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(57) Abstract 



A general purpose, programmable media processor (12) for processing and transmitting a media data streams. The media processor 
(12) innoTporatcs an execution unit (100) that maintains substantially peak data throughout of media data streams. The execution unit 
(100) includes a dynamically partionable multi-precision arithmetic unit (102), programmable switch (104) and piogrammable extended 
mathenmtical clement (106). A high bandwidth external interface (124) supplies media data streams at substantially peak rates to a general 
purpose register file (110) and the execution unit, A memory management unit, and insnuaion and data cache/buffers (118. 120). The 
general purpose, programmable media processor (12) is disposed in a network fabric consisting of fiber optic cable, coaxial cable and 
twisted pair wires to transmit, process and receive single or unified media data streams. 
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GENERAL PURPOSE, PROGRAMMABLE MEDIA PROCESSOR 

5 



10 

Field of the Tnvpnf inn 

This invention relates to the field of communicaiions processing, 
and more particularly, to a method and apparatus for real-time processing of 
15 multi-media digital communications. 

Background of the Invention 

Optical fiber and discs have made the transmission and storage of 
digital information both cheaper and easier than older anaTog technologies. An 
20 improved system for digital processing of media data streams is necessary in order 
to realize the fiiU potential of these advanced media. 

For the past cenmry. telephone service delivered over copper 
twisted pair has been the lingua franca of communications. Over the next century, 
broadband services deUvei^ over opucal fiber and coax will more completely 
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fiilfUl the human need for sensory infonnation by supplying voice, video, and dau 
at rates of about 1.000 times greater than nanow band telephony. Current 
general-purpose microprocessors and digital signal processors ("DSPs") can handle 
digital voice, data, and images at narrow band rates, but they are way too slow for 
5 processing media data at broadband rates. 

This shortfall in digital processing of broadband media is currently 
being addressed through the design of many different kinds of appUcation-specific 
integrated circuiu ("ASICs-). For example, a prototypical broadband device 
such as a cable modem modulates and demodulates digital data at rates up to 45 
10 Mbits/sec within a single 6 MHZ cable chamiel (as compared to rates of 28.8 

Kbits/sec within a 6 KHz channel for telephone modems) and tnnscodes it onto a 
10/lOObaseT coraiection to a penonal computer ("PC") or workstation. Current 
cable modems thus receive data from a coaxial cable comiection through a chain of 
specialized ASIC devices in order to accompUsh Quadrature Amplitude 
Modification ("QAM") demodulation. Reed-Solomon error coirecrion, packet 
filtering. Data Encryption Standard ("DES") decryption, and Ethernet protocol 
handling. The cable modems also transmit dau to the coaxial cable link through a 
second chain of devices to achieve DES encryption, Reed-Solomon block 
encoding, and Quaternary Phase Shift Keying ("QPSK") modulation. In these 
environments, a general-purpose processor is usuaUy required as wcU in order to 
perform initialization, statistics coUection, diagnostics, and network management 
funaions. 

The ASIC approach to media processing has three fimdamental 
flaws: cost, complexity, and rigidity. Tlie combined silicon area of all the 

25 specialized ASIC devices required in the cable modem, for example, results in a 
component cost incompatible with the per subscriber price target for a cable 
service. The cable plant itself is a very hostile service environment, with noise 
ingress, reflections, nonlinear amplifiers, and other channel impairments, 
especially when viewed in the upstream direction. Telephony modems have 

30 developed an elaborate hierarchy of algorithms implemented in DSP software, with 
automatic reduction of data rates from 28.8Kbits/sec to l9.6Kbits/sec, 
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14.4Kblts/sec, or much lower r.tes as needed to accommodate noise, echoes and 
other unpainnents in the copper plant. To implement simUar algorithms on an 
ASIC-based broadband modem is far more complex to achieve in software. 

These problems of cost, complexity, and rigidity are compounded 
funher m more complete broadband devices such as digital set-top boxes 
multimedia PCs. or video conferencing equipmem. all of which go beyond the 
bastc radio frequency C^RF") modem functions to include a broad range of audio 
and vtdeo compression and decoding algorithms, along with remote control and 
graphical user interfaces. SofHare for these devices must control what amounts to 
a heterogeneous multi-processor, where each specialized processor has a different 
and usually eccentric or primitive, programmuig environment. Even if these 
programming enviromnents are mastered, the degree of programmability is 
bmited. For example. Motion Picnire Expen Group-I CMPEG-D chips 
manufactured by AT&T Corporation will not implement advances such as fnctal- 
and wavelet-based compression algorithms, but these chips are not readUy software 
upgradeable to the MPEG-E standard. A broadband network operator who leases 
an MPEG ASIC-based product is therefore at risk of having to continuouslv 
upgrade his system by purchasing significant amounts of new hardware just to 
track the evolution of MPEG standards. 

The high cost of ASIC-based media processing resul.s from 
inefficiencies in both memory and logic. A typical ASIC consists of a muU.plici.v 
of specialized logic blocks, each with a sihall memory dedicated to holduiH the 
data which comprises the working set for that block. The silicon area of these 
multiple small memories is funher increased by the overhead of muhiple decoders, 
sense amplifiers, write drivers, etc. required for each logic block. Hie logic 
blocks are also constrained to operate at frequencies determined by the internal 
symbol rates of broadband algorithms in order to avoid additional buffer 
memories. These frequencies typically differ from the optimum speed-area 
operating point of a given semiconductor technology. Inlercomicct and 
synchronization of the many logic and memory blocks are also major sources of 
overhead in the ASIC approach. 
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The disadvaniages of the prior ASIC approach can be over come by 
a single unified media processor. The cost advantages of such a unified pixxressor 
can be achieved by gathering aU the many ASIC funaions of a broadband media 
product into a single integrated circuit. Cost, reduction is fiiithcr increased by 
reducing the total memory area of such a circuit by replacing the multipUcity of 
smaD ASIC memories with a single memory hierarchy large enough to 
accommodate the sum total of all the working sets, and wide enough to supply the 
aggregate bandwidth needs of all the logic blocks. Additionally, the logic block 
interconnect circuitry to this memory Werarchy may be stieamlined by providing a 
generaUy programmable switching fabric. Many of the logic blocks themselves 
can also replaced with a single multi-precision arithmetic unit, which can be 
intemaUy partitioned under software control to perform addition. mulupUcation. 
division, and other integer and floating point arithmetic operations on symbol 
streams of varying widths, while sustaining the fuU data throughput of the memory 
hierarchy. The residue of logic blocks that perform opeiations th^ are neither 
arithmetic or permutation group oriented can be replaced with an extended math 
unit that supports additional arithmetic operations such as fuiite field, ring, and 
table lookup, while also sustaining the fiiU data throughput of the memory 
hierarchy. 

The above multi-precision arithmetic, permutation switch, and 
extended matii operations can then be organized as machine instructions that 
transfer their operands to and from a single wide multi-ported register file. These 
instructions can be further supplemented with load/store instructions that transfer 
register data to and from a data buffer/cache static random access memory 
("SRAM") and main memoiy dynamic random access memories ("DRAMs"), and 
with branch instructions that control the flow of insiractions executed from an 
instruction buffer/cache SRAM. Extensions to the load/store instructions can be 
made for synchronization, and to branch instnictions for protected gateways, so 
tiiat multiple threads of execution for audio, video, radio, encryption, networking, 
etc. can efficiendy and securely share memory and logic resources of a unified 
machine operating near the optimum speed-area point of the target semiconductor 
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process. The data path for such a unified media processor can interface to a high 
speed input/output ("I/O") subsystem that moves media streams across ultra-high 
bandwidth interfaces to exterraJ storage and I/O. 

Such a device would incorporate all of the processing capabiUties of 
5 the specialized multi-ASIC combination into a single, unified processing device, 
nie unified processor would be agUe and capable of reprograraming through the 
transmission of new programs over the communication medium. This 
programmable, general purpose device is thus less costly than the specialized 
processor combination, easier to operate and reprogram and can be installed or ' 
10 applied in many differing devices and situations. The device may also be scalable 
to communications applications that support vast numbers of users through 
massively parallel distributed computing. 

It is therefore an objea of this invention to process media data 
streams by executing operations at very high bandwidth rates. 

It is also an objea of this invention to unify the audio, video, radio, 
graphics, encryption, authentication, and networking protocok into a single 
instruction stream. 

It is also an object of this invention to achieve high bandwidth rates 
in a unified processor that is easy to program and more flexible than a 
20 heterogeneous combination of special purpose processors. 

It is a further object of the invention to support high level 
mathematical processing in a unified media processor, including finite group, fmite 
field, fmite ring and table look-up operations, all at high bandwidth rates. 

It is yet a further object of the invention to provide a unified media 
processor that can be replicated into a multi-processor system to support a vast 
array of users. 

It is yet another object of this invention to allow for massively 
parallel systems within the switching fabric to suppon very large numbers of 
subscribers and services. 
30 It is also an object of the invention to provide a general purpose 

programmable processor that could be employed at all points in a network. 
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It is a hirthcr obj«t of this invention to sustain very high bandwidth 
rates to arbitrarily large memoiy and input/output systems. 

Summary nf th>> TnwnTi^n 

In view of the above, there is provided a system for media 
processing that maintains substantially peak data throughput in the execution and 
transmission of muUiple media data streams. The system includes in one aspect a 
general purpose, programmable media processor, and in another aspea includes a 
method for receiving, processing and transmitting media data streams. The 
general purpose, programmable media processor of the invention funher includes 
an execution unit, high bandwidth external interface, and can be employed in a 
parallel multi-processor system. 

According to the appaiams of the invention, an execution unit is 
provided that maintains substantially peak data throughput in the unified execution 
of multiple media data streams. The execution unit includes a data path, and a 
multi-precision arithmetic unit coupled to the data path and capable of dynamic 
partitioning based on the elcmenul width of dau received from the dau path. The 
execution unit also includes a switch coupled to the data path that is programmable 
to manipulate data received from the dau path and provide data streams to the 
data path. An extended mathematical element is also provided, which is coupled 
to the data path and programmable to implcmcm additional mathematical 
operations at substantially peak data throughput. In a prefem^d embodiment of the 
execution unit, at least one register file is coupled to the data path. 

According to another aspect of the invention, a general purpose 
programmable media processor is provided having an instruction path and a data 
paUi to digitally process a plurality of media data streams. The media processor 
includes a high bandwidth external interface operable to receive a plurality of data 
of various sizes from an external source and communicate the received data over 
the data path at a rate that maintains substantially peak operation of the media 
processor. At least one register file is included, which is configurable to receive 
and store data from the data path and to communicate Uie stored data to the data 
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path. A multi-precision execution unit is coupled to the data path and is 
dynamically configurable to panition data received from the data path (o account 
for the elemental symbol size of the pluiaUty of media streams, and is 
programmable to operate on the data to generate a unified symbol output to the 
5 data path. 

According to the preferred embodiment of the media processor, 
means are included for moving data between registers and memory by perfonnine 
load and store operations, and for coordinating the sharing of data among a 
plurality of tasks by performing synchronization operations based upon imtructions 
and data received by the execution unit. Means are also provided for securely 
controlling the sequence of execution by performing branch and gateway 
operations based upon instructions and data received by the execution unit. A 
memory management unit operable to retrieve data and instructions for timely and 
secure communication over the data path and instruction path respectively is also 
prefeinbly included in the media processor. The preferred embodiment also 
includes a combined instruction cache and buffer that is dynamically allocated 
between cache space and buffer space to ensure real-time execution of multiple 
media instruction streams, and a combined data cache and buffer that is 
dynamicaUy aUocated between cache space and buffer space to ensure real-time 
20 response for multiple media data streams. 

In another aspect of the invention, a high bandwidth processor 
interface for receiving and transmitting a media stream is provided having a data 
path operable to transmit media information at sustained peak rates. The high 
bandwidth processor interface includes a plurality of memory controUers coupled 
in series to communicate stored media information to and from the data path, and 
a plurality of memory elements coupled in parallel to each of the plurality of 
memory controllers for storing and retrieving the media information. In the 
preferred embodiment of the high bandwidth processor interface, the plurality of 
memory controUers each comprise a paired link disposed between each memory 
30 controUer. where the paired links each transmit and receive plural bits of data and 
have differential dau inputs and outputs and a differential clock signal. 
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Yet another aspect of the invention includes a system for unified 
media processing having a pluialiiy of general purpose media processors, where 
each media processor is operable at substantially peak data rates and has a 
dynamically panitioned execution unit and a high bandwiddi interface for 
communicating to memory and input/output elements to supply data to the media 
processor at substantially peak rates. A bi-din»tional communication fabric is 
provided, to which the plurality of media processors are coupled, to transmit and 
receive at least one media stream comprising presentation, transmission, and 
storage media information. The bi-direaional communication fabric preferably 
comprises a fiber optic network, and a subset of the plurality of media processors 
comprise network servers. 

According to yet another aspect of the invention, a paraUel multi- 
media processor system is provided having a data path and a high bandwidth 
external interface coupled to the dau path and operable to receive a plurality of 
data of various sizes from an external source and communicate the received dau at 
a rate that maintains substantially peak operation of the parallel multi-processor 
system. A pluraUty of register files, each having at least one register coupled to 
the data path and operable to store data, are also included. At least one multi- 
precision execution unit is coupled to the data path and is dynamicaUy configurable 
to panition dau received from the dau path to accoum for the clemenul symbol 
size of the plurality of media streams, and is programmable to operate in paraUel 
on dau stored in the plurality of register files to generate a unified symbol output 
for each register file. 

According to the method of the invention, unified su?ams of media 
dau are processed by receiving a stream of unified media dau including 
presenution. transmission and storage information. The unified stream of media 
dau is dynamically partitioned into component fields of at least one bit based on 
the elemental symbol size of dau received. The unified stream of media dau is 
then processed at substantially peak operation. 

In one aspect of the invention, the unified stream of media dau is 
processed by storing the stream of unified media dau in a geneial register file. 
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Multi-precision arithmetic operations can then be perfonned on the stored stiean, 
of uiufied media data based on programmed instructions, where the multi-precision 
arithmetic operations include Boolean, integer and floating point mathematical 
operations. The componem fields of unified media dau can then be manipulated 
based on programmed instructions that implement copying, shifting and re-sizing 
operations. Multi-precision mathematical operations can also be performed on the 
stored stream of unified media data based on programmed instnictions. where the 
mathematical operations including finite group, finite field, finite ring and table 
look-up operations. Instruction and data pre-fetching are included to fiU 
instruction and data pipelines, and memory managemem operations can be 
performed to retrieve instructions and data from extenial memory. Hie 
instructions and data are preferably stored in insuuction and data cache/buffers, ii, 
which buffer storage in the instruction and data cache/buffers is dynamically 
aUocated to ensure real-time execution. 

Other aspects of tiie invention include a method for achieving high 
bandwidth communications between a general purpose media processor and 
external devices by providing a high bandwidUi interface disposed between the 
media processor and the external devices, in which the high bandwidth interface 
comprises at least one uni-directional chamiel pair having an input pon and an 
output port. A plurality of media data stiieams, comprising componem fields of 
various sizes, are transmitted and received between the media processor and the 
external devices at a rate that sustains substantially peak data throughput at the 
media processor. A method for processing streams of media data is also included 
that provides a bi-directional communications fabric for transmitting and receiving 
25 at least one stream of media data, where the at least one stream of media data 
comprises presentation, transmission and storage information. At least one 
programmable media processor is provided within the communications network for 
receiving, processing and transmitting the at least one stream of unified media data 
over the bi-directional communications fabric. 

The general purpose, programmable media processor of the 
invemion combines in a single device aU of Uie necessary hardware included in the 
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specialized processor combinations lo process and communicafe digital media data 
Streams in real-time. The general purpose, programmable media processor is 
therefore cheaper and more flexible than the prior approach to media processing. 
The general purpose, programmable media processor is thus mote susceptible to 
incoipotation within a massively parailel processing network of geneial purpose 
media processors that enhance the ability to provide real-time multi-media 
communications to the masses. 

These feanires are accompUshed by deploying server media 
processors and cliem media proccsson throughout the network. Such a network 
provides a seamless, global media super-computer which allows prognunmers and 
network owners to vimialize resources. Rather than restrictively accessing only 
the memory space and processing time of a local resource, the system allows 
access to resources throughout the network. In small access points such as 
wireless devices, where very little memory and processing logic is available due to 
limited battery life, the system is able to draw upon the resources of a 
homogeneous multi-computer system. 

The invention also allows network owners the facility to track 
standards and to deploy new services by broadcasting software across the network 
lather than by instiniting costly hardware upgrades across the whole network. 
Broadcasting software across the network can be performed at the end of an 
advertisement or other program that is broadcasted nationally. Thus. ser% ices can 
be advertised and then transmitted to new subscribers at the end of the 
adveitisement. 

These and other features and advantages of the invention wUl be 
apparent upon consideration of the foUowing detailed description of the presently 
preferred embodiments of the invention, taken in conjunaion with the appended 
drawings. 

Brief De scription nf the Drawings 

HG. 1 is a block diagram of a broad band media computer 
employing the general puipose, progiammable media processor of the invention; 

10 
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HG. 2 is a block diagram of a global media processor employing 
multiple general purpose media processors according to the invention: 

FIG. 3 is an illusiraiion of the digital bandwidth spectrum for 
telecommunications, media and computing communications; 

FIG. 4 is the digital bandwidth spectrum shown in HG. 3 taking 
into account the bandwidth overhead associated with compressed video techniques: 

FIG. 5 is a block diagiara of the current specialized processor 
solution for mass media communication, where HG. 5(a) shows the current 
distributed system, and FIG. 5(b) shows a possible integrated approach; 

FIG. 6 is a block diagram of two presently preferred general 
purpose media processors, where HG. 6(a) shows a distributed system and HG. 
6(b) shows an integrated media processor; 

HG. 7 is a block diagram of the presently preferred struaure of a 
general purpose, programmable media processor according to the invention: 

FIG. 8 is a drawing consisting of visual illustrations of the various 
group operations provided on the media processor, wher« HG. 8(a) illustrates the 
group expand operation. FIG. 8(b) illustrates the group compress or extract 
operation, HG. 8(c) iUustrates the group deal and shuffle operations, HG. 8(d) 
illustrates the group swizzle operation and HG. 8(c) illustrates the various group 
20 permute operations; 

FIG. 9 shows the prefencd instruction and data sizes for the general 
purpose, programmable media processor, where HG. 9(a) is an iUustration of the 
various instruction formats available on the general purpose, programmable media 
processor, FIG. 9(b) illustrates ihe various floating-point data sizes available on 
25 the general purpose media processor, and FIG. 9(c) IUustrates the various fixed- 
point data sizes available on the general purpose media processor; 

FIG. 10 is an illustration of a presendy preferred memory 
management unit included in the general purpose processor shown in FIG. 7. 
where FIG. 10(a) is a translation block diagram and HG. 10(b) illustrates the 
30 functional blocks of Uie transaction lookaside buffer, 

HG. 1 1 is an illustration of a super-string pipeline technique; 
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no. 12 is an UlustraUon of the presently preferred super-spring 
pipeline technique: 

HG. 13 is a block diagram of a single memory channel for 
communication to the general purpose media processor shown in FIG. 7; 

HG. 14 U an illustration of the presently preferred connection of 
standaid memory devices to the preferred memory interface; 

HG. 15 is a block diagram of the input/output controller for use 
with the memory channel shown in FIG. 13; 

FIG. 16 is a block diagram showing multiple memory channels 
comiectcd to the general purpose media processor shown in HG. 1, where HG. 
16(a) shows a two-charaiel implementation and HG. 16(b) illustrates a twelve- 
channel embodiment; 

HG. 17 illustrates the presently preferred packet communications 
protocol for use over the memory channel shown in FIG. 13; 

HG. 18 shows a multi-processor configuration employing the 
general purpose media processor shown in HG. 7, where FIG. 18(a) shows a 
linear processor configuration. HG. 18(b) shows a processor ring configuration, 
and FIG. 18(c) shows a two^limensional processor configuration; and 

HG. 19 shows a presently preferred multi-chip implementation of 
20 the general purpose, programmable media processor of the invention. 

Detailed Description of the PrP^Pntlv T>refem.d Fmhnrf;.r,..>e 

Referring to the drawings, where like-reference numerals refer to 
like elements throughout, a broad band micixjcompuier 10 is provided in HG. I. 
The broad band microcomputer 10 consisu essenuaUy of a general puipose media 
processor 12. As will be described in more detail below, the general purpose 
media processor 12 receives, processes and transmits media data streams in a bi- 
directional manner from upstream network components to downstream devices. In 
general, media data streams received from upstream network components can 
30 comprise any combination of audio, video, radio, graphics, encryption, 

authentication, and networking information. As those skilled in the an will 
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appreciate, however, the general purpose media processor 12 is in no way limited 
to receiving, processing and iransmining only these types of media information. 
The general purpose media processor 12 of the invention is capable of processing 
any form of digital media information without depaning from the spirit and 
essential scope of the invention. 

System Configura^lffn 

In the preferred embodimem of the invention shown in FIG. K 
media data streams are communicated to the media processor 12 from several 
sources. Ideally, unified media data streams are received and tmnsmined by the 
general purpose media processor 12 over a fiber optic cable network 14. will 
be described in more detail below, although a fiber optic cable network is 
preferred, the presently existing communications network in the United States 
consists of a combination of fiber optic cable, coaxial cable and other transmission 
media. Consequently, the general purpose media processor 12 can also receive 
and transmit media data streams over coaxial cable 14 and traditional twisted pair 
wire connections 16. The specific communications protocol employed over the 
twisted pair 16, whether POTS. ISDN or ADSL, is not essential; aU protocols are 
supported by the broad band microcomputer 10. THe deiaDs of these protocols are 
generally known to those skilled in the an and no fiinher discussion is therefore 
needed or provided herein. 

Another form of upstream network communication is through a 
satellite link 18. The satellite link 18 is typically connected to a satellite receiver 
20. The satellite receiver 20 comprises an antenna, usually in the form of a 
satellite dish, and amplification circuitry. The details of such satellite 
communications are also generaUy known in the art. and further detail is therefore 
not provided or included herein. 

As described above, the general purpose media processor 12 
communicates in a bi-dircctional manner to receive, process and transmit media 
dau streams to and from downstream devices. As shown in FIG. 1, downstream 
communication preferably takes place in at least two forms. First, media data 
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streams can be communicated over a bi-directional local network 22. Various 
types of local networks 22 are generally known in the ait and many different 
forms exist. Tht general purpose media processor 12 is capable of communicating 
over any of these local networks 22 and the particular type of network selected is 
implementation specific. 

The local network 22 is preferably employed to communicate 
between the unified processor 12, and audio/visual devices 24 or other digital 
devices 26. Presently preferred examples of audio/visual devices 24 include 
digital cable television. videoH>n-demand devices, electronic yeUow pages seivices. 
imegiated message systems, video telephones, video games and electronic program 
guides. As those skilled in the an will appreciate, other forms of audio/video 
devices art; contemplated within the spirit and scope of the invention. Presently 
preferred embodiinents of other digital devices 26 for communication with the 
general purpose media processor 12 include personal computers, television sets, 
work stations, digital video camera recorders, and compact disc read-only 
memories. As those skilled in the art will also appreciate, further digital devices 
26 are contemplated for communication to the general purpose media processor 12 
without departing from the spirit and scope of the invention. 

Second, the general purpose media processor preferably also 
communicates with downstream devices over a wireless network 28. In the 
presenUy preferred embodiment of the invention, wireless devices for 
communication over the wireless network 28 can comprise either remote 
communication devices 30 or remote computing devices 32. Presently preferred 
embodiments of the remote communications devices 30 include cordless telephones 
and personal communicators. PresenUy preferred embodimems of the remote 
computing devices 32 include remote controls and telecommunicating devices. As 
those skilled in the art will appreciate, other forms of remote communication 
devices 30 and remote computing devices 32 are capable of communication with 
the general purpose media processor 12 without departing from the spirit and 
scope of the invention. An agile digital radio (not shown) that incorporates a 
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Networic rnnfieuration 

Referrin? now to FIG. 2. the general purpose media processor 12 is 
preferably disposed throughout a digital communications network 38. In oider to 
enable communication among large and small businesses, residential customers and 
mobUe users, the network 38 can consist of a combination of many individual sub- 
networks comprised of three main forms of intercomiection. The trunk and main 
branches of the network 38 preferably employ fiber optic cable 40 as the preferred 
means of interconnection. Fiber optic cable 40 is used to comiea between general 
purpose media processors 12 disposed as network servers 46 or large business 
installations 48 that are capable of coupling diiecUy to the fiber optic link 40. For 
communications to smaU business and residential customers that may be incapable 
of directly coupling to the fiber optic cable 40. a general purpose media processor 
12 can be used as an interface to other forms of network interconnection. 

As shown in HG. 2, alternate forms of interconnection consist of 
coaxial cable lines 42 and twisted pair wiring 44. Coaxial cable lines are cunemly 
in place throughout the U.S. and is typically employed to provide cable television 
services to residential homes. According to the preferred embodiment of the 
invention, general purpose media processon 12 can be installed at these residemial 
locations 52. In contrast to the specialized processor approach, the general 
purpose media processor 12 provides enough bandwidth to allow for bi-directional 
communications to and from these residential locations 52. 

Network servers 46 controUed by general purpose media processors 
12 are also employed throughout the network 38. For example, the network 
servers 46 can be used to interface between the fiber optic network 40 and twisted 
pair wiling 44. Twisted pair wiring 44 is still employed for smaU businesses 50 
and residential locations 52 that do not or cannot currently subscribe to coaxial 
30 cable or fiber optic network ser>-ices. General purpose media processors 12 are 
also disposed at these smaU business locations 50 and non-cable residential 
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locations 52. General purpose media processors 12 are also installed in wireless 
or mobile locations 52, which are coupled to the network 38 th«,ueh agile digital 
radios (not shown). As shown in FIG. 2. network databases or other peripherals 
56 can also coupled to general puipose media processors 12 in the network 38. 
5 The general puipose media processor 12 is operable at sifihificantly 

high bandwidths in order to receive, process and transmit unified media'data 
streams. Referring to HG. 3. the respeaive frequencies for various types of 
media data streams are set forth against a bandwidth speat^m 60. The bandwidth 
spectnim 60 includes three componem spectrums. all along the same ranee of 
frequencies, which represem the various frequency rates of digital media' 
communications. Current computing bandwidth capabilities are also displayed. 
The telecommunications spectrum 62 shows the various frequency bands used for 
telecommunications transmission. For example, teletype terminals and modems 
operate in a range between appro.ximately 64 bits/second to 16 kUobits/second. 
-nie ISDN telecommunication protocol operates at 64 kilobits/second. At the 
upper end of the telecommunications spectnim 62, Tl and T3 trunks operate at 
one megabit per second and 32 megabits per second respectively. The SONET 
frequency range extends from approximately 128 megabits per second up to 
approximately 32 gigabits per second. Accordingly, in order to cany such broad 
band communications, the general purpose media processor 12 is capable of 
transferring information at rates into the gigabits per second range or higher. 

A spectrum of typical media data streams is presented in the media 
spectrum 64 shown in FIG. 3. Voice and music transmissions are centered at 
frequencies of approximately 64 kilobits per second and one megabit per second, 
respectively. At the upper end of the media spectrum 64, video transmission takes 
place in a range from 128 megabits per second for high density television up to 
over 256 gigabits per second for movie appUcations. When using common video 
compression techniques, however, the video transmission spectrum can be shifted 
down to between 32 kilobits per second to 128 megabits per second as a result of 
the data compression. As described below, the pitx:essing required to achieve the 
data compression results in an increase in bandwidth requirements. 
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Current computing bandwidths are shown in the computing spectnim 
66 of HG. 3. Serial communications presently take place in a range between two 
kilobits per second up to 512 kilobits per second. The Ethernet network protocol 
operates at approximately 8 megabits per second. Current dynamic r^dom access 
5 memory and other digital input/output peripherals operate between 32 megabits per 
second and 512 megabits per second. Presemly available microprocesson are 
capable of operation in the low gigabits per second range. For example, the 386 
Pentium microprocessor manufactured by Intel Corpoiation of Santa Clara. 
California operates in the lower half of that range, and the Alpha microprocessor 
10 manufactured by Digital Equipment Corporation approaches the 16 gigabits per 
second range. 

When video compression is employed, as expressed above, the 
associated processing overhead reduces the effective bandwidth of the panicular 
processor. As a result, in order to handle compressed video, these processors 
must operate in the terahertz frequency range. The bandwidth spectmm 60 shown 
in HG. 4 represents the effect of handling media data streams including 
compressed video. The computing spectrum 66 is skewed down to properly align 
the computing bandwidth requirements with the telecommunicauons spectnim 62 
and the media spectrum 64. Accordingly, current processor technology is not 
20 sufficiem to handle the transmission and processing associated with complex 
streams of multi-media data. 

The current specialized processor approach to media processing is 
illustrated m the block diagram shown in HG. 5. As shown in HQ. 5. special 
purpose processors are coupled to a back plane 70. which is capable of 
25 transmitting instructions and data at the upper kilobits to lower gigabits per second 
range. In a typical configuration, an audio processor 76. video processor 78. 
graphics processor 80 and network processor 82 are all coupled to the back plane 
70. Each of the audio, video, graphics and network processors 76-82 typically 
employ their own private or dedicated memories 84, which are only accessible to 
30 the specific processor and not accessible over the back plane 70. As described 
above, however, unless video data streams are constantly being processed, for 
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example, the video processor 78 will sii idle for periods of time. The computing 
power of the dedicated video processor 78 is thus only available to handle video" 
data streams and is not available to handle other media data streams that are 
directed to other dedicated processors. This, of course, is an inefficient use of the 
video processor 78 particularly in view of the overall processing capability of this 
multi-processor system. 

The general purpose media processor 12, in contrast, handles a data 
stream of audio, video, graphics and network information all at the same time with 
the same processor. In order to handle the ever changing combination of data 
types, the general purpose media processor 12 is dynamicaUy paititionable to 
allocate the appropriate amoum of processing for each combination of media in a 
unified media data stream. A block diagram of two preferred geneiai puipose 
media processor system configurations is shown in HG. 6. Referring to FIG. 
6(a), a general purpose media processor 12 is coupled to a high-speed back plane 
90. The presenUy preferred back plane 90 is capable of operation at 30 gigabits 
per second. As those skilled in the art will appreciate, back planes 90 that are 
capable of operation at 400 gigabits per second or greater bandwidth are 
envisioned within the spirit and scope of the invention. Multiple memory devices 
92 are also coupled to the back plane 90. which are accessible by the general 
purpose media processor 12. Input/output devices 94 are coupled to the back 
plane 90 through a dual-ported memory 92. The configuration of the bput/output 
devices 94 on one end of the dual-ported memory 92 aUows the sharing of these 
memory devices 92 throughout a network 38 of general purpose media processors 



12 



Alternatively. FIG. 6(b) shows a presently preferred integrated 
general purpose media processor 12. The integrated processor includes on-board 
memory and I/O 86. The on-boaid memory is preferably of sufficient size to 
optimize throughput, and can comprise a cache and/or buffer memory or the like. 
The integrated media processor 12 also connects to external memory 88, which is 
preferably larger than the on-board memory 86 and forms the system main 
memory. 
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Execution Unit 

One presently preferred embodiment of an integrated general 
puipose media processor 12 is shown in HG. 7. The core of the integrated 
general purpose media processor 12 comprises an execution unit 100. Three main 
5 elements or subsections are included in the execution unit 100. A multiple 
precision aiiihmeiic/logic unit { •.AJ.U") 102 perfonns all logical and simple 
arithmetic operations on incoming media data streams. Such operations consist of 
calculate and comrol operations such as Boolean functions, as well as addition, 
subtraction, multipUcation and division. These operations are performed on single 

10 or unified media data streams transmined to and from the multiple precision ALU 
102 over a data bus or data path 108. Preferably die data path 108 is 128 bits 
wide, although those skiUed in the an will appreciate that the data path 108 can 
take on any width or size without departing from the spirit and scope of the 
invention. The wider the data path 108 the more unified media data can be 

15 processed in parallel by the general purpose media processor 12. 

Coupled to the multi-precision ALU 102 via the data path 108. and 
also an element of the execution unit 100, is a programmable switch 104. The 
programmable switch 104 performs data handling operations on single or unified 
media data streams transmined over the data path 108. Examples of such data 

20 handling operations include deals, shuffles, shifts, expands, compresses, swizzles, 
pennutes and reverses, although other data handling operations are contemplated. 
These operations can be performed on single bits or bit fields consisting of two or 
more bits up to the entire width of the data path 108. Thus, single bits or bit 
fields of various sizes can be manipulated through programmable operation of the 

25 switch 104. 

Examples of the presendy preferred data manipulation operations 
performed by the general purpose media processor 12 are shown in FIG. 8. A 
group expand operation is visually illustrated in FIG. 8(a). According to the 
group expand operation, a sequential field of bits 270 can be divided into 
constiment sub-fields 272a-272d for insertion into a larger field array 274. The 
reverse of the group expand operation is a group compress or extract operation. A 
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visual iilustiation of the group compress or extract operation is shown in FIG. 
8(b). As shown, separate sub-fields 272a-272cl from a larger bit field 274 can be 
combined to form a contiguous or sequential field of bits 270. 

Referring to HGS. 8{c)-8(e), group deal, shuffle, swizzle and 
permute operations performed by the programmable switch 104 are also 
illustrated. The operations peiformed by these instructions are readUy understood 
from a review of the drawings. The group manipulation operations illustrated in 
HGS. 8(a)-8(e) comprise the presently contemplated data manipulation operations 
for the general purpose media processor 12. As those skilled in the an wiU 
appreciate, either a subset of these operations or additional dau manipulation 
operations can be incorporated in other alternate embodiments of the general 
purpose media processor 12 without departing from the spirit and scope of the 
invention. 

Referring again to HG. 7, higher level mathematical operations than 
those perfonned by die multi-precision ALU 102 are performed in the general 
purpose media processor 12 through an extended math element 106. The extended 
math element 106 is coupled to the dau path 108 and also comprises pan of the 
execution unit 100. The extended math elemem 106 perfonns the complex 
arithmetic operations necessary for video data compression and simUarly intensive 
mathematical operadons. One presendy preferred example of an extended math 
operation comprises a Galois field operation. Otfier examples of extended 
mathematical functions perfonned by the extended math element 106 include CRC 
generation and checking, Reed-Solomon code generation and checking, and 
spread-spcctnun encoding and decoding. As those skiUed in the an appreciate. 
25 additional madiemaiical operations are possible and contemplated. 

According to die preferred embodiment of die integrated general 
purpose media processor 12. a register file 110 is provided in addition to the 
execution unit 100 to process media data. The register fUe 1 10 stores and 
transmits data streams to and from the execution unit 100 via the data path 108. 
Rather than employing a complex set of specific or dedicated registers, the general 
purpose media processor 12 preferably includes 64 general puipose registers in the 

20 
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register file 110 along with one proeram counter (not shown). The 64 geneial 
puipose registers contained in the register fUe 110 are all available to the 
user/programmer, and comprise a ponion of the user state of the general purpose 
media processor 12. The general purpose registers are preferably capable of 
storing any form of data. Each register within the register file 110 is coupled to 
the data path 108 and is accessible to the execution unit 100 in the same manner. 
Thus, the user can employ a general purpose register accoiding to the specific 
needs of a panicular program or unique appUcation. As those skilled in the an 
wiU appreciate, the register file 1 10 can also comprise a plurality of register files 
110 configured in parallel in order to support parallel multi-threaded processing. 



Instruction Set and User Pro^pimino 

Control or manipulation of data processed by the general punx)se 
media processor 12 is achieved by seleaed instructions programmed by the user. 
15 Those skilled in the ait will appreciate that a great number of programs are 

possible through various sequences of instiucuons. Particular programs can be 
developed for each unique implementation of the general purpose media processor 
12. A detaUed discussion of such specific programs is therefore beyond the scope 
of this description. 

One presently preferred instruction set for the general purpose 
media processor 12 is included in the Microfiche Appendbc, the contents of which 
are hereby incorporated herein by reference. A list of the presently preferred 
major operation codes for the general purpose media processor 12 appears below 
in Table I. 
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MAJOR OPERATION CODES 




TABLE I 



As shown in Table I, the major operation codes are grouped according to the 
function performed by the operations. Hie operations are thus amnged and Usted 
above according to the presently preferred operation code number for each 
instruction. As many as 255 separate operations are contemplated for the 
preferred embodiment of the general purpose media processor 12. As shown in 
Table I. however, not all of the operation codes are presently implemented. As 
those skilled in the an will appreciate, alternate schemes for organizing the 
operation codes, as weU as addiuonal operation codes for the general purpose 
media processor 12, are possible. 
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The instniciions provided in the instruction set for the geneial 
purpose media processor 12 control the transfer, processing and manipulation of 
data streams between the register fUe 1 10 and the execution unit 100. The 
presently preferred width of the instruction path 112 is 32-bits wide, organized as 
four eight-bit bytes ("quadlets"). Those skilled in the art will appreciate, 
however, that the instruction path 1 12 can take on any width without departing 
from the spirit and scope of the invention. Preferably, each instruction within the 
instruction set is stored or organized in memory on four-byte boundaries. The 
presently preferred format for instructions is shown in FIG. 9(a). 

As shown in FIG. 9(a), each of the presently preferred instruction 
fonnats for the general purpose media processor 12 includes a field 280 for the 
major operation code number shown in Table I. Based on the type of operation 
performed, the remaining bits can provide additional operands according to the 
type of addressing employed with the operation. For example, the remainder of 
15 the 32-bit instruction field can comprise an immediate operand ("iram"), or 

operands stored in any of the general registers ("ra." "rb," "rc," and "rd"). In 
addition, minor operation codes 282 can also be included among the operands of 
certain 32-bit instruction formats. 

The presently preferred embodiment of the general purpose media 
20 processor 12 inchrdes a limited bstruction set similar to those seen in Reduced 

Instruction Set Computer ("RISC") systems, nie preferred instruction set for the 
general purpose media processor 12 shown in Table I includes operations which 
implement load, store, synchronize, branch and gateway functions. These five 
groups of operations can be visuaUy represented as two general classes of related 
25 operations. The branch and gateway operations perform related functions on 
media data streams and are thus visuaDy represented as block 1 14 in FIG. 7. 
SimUarly, the load, store and synchronize operations are grouped together in block 
1 16 and perform similar operations on the media data streams. (Blocks 1 14 and 
1 16 only represent the above classification of these operations and their fiinciion in 
the processing of media data streams, and do not indicate any specific underlying 
electronic connections.) A more detailed discussion of these operations, and the 
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ftincuonality of the general purpose media piocessor 12, appears in the Microfiche 
Appendix. 

The four-byte structure of instructions for the general purpose media 
processor 12 is preferably independent of the byte ordering used for any data 
structures. Nevenheless. the gateway instructions are specifically defined as 16- 
byte structures containing a code address used to securely invoke a procedure at a 
higher privUege level. Gateways are preferably marked by proteaion information 
specified in the translation lookaside buffer 148 in the memory management unit 
122.. Gateways are thus preferably aHgned on 16-byte boundaries in the external 
memory. In addition to the general purpose registen and program counter, a 
privUege level register is provided within the register file 110 that contains the 
privUege level of the currently executing insiniaion. 

The instruction set preferably includes load and store instructions 
that move data between memory and the register file 110, branch instructions to 
compare the conttm of registers and transfer control, and arithmetic operations to 
perform computations on the contents of registers. Swap instructions provide 
multi-thiead and multi-processor synchronization. These operations are preferably 
indivisible and include such instructions as add-and-swap. compare-and-swap. and 
multiplex-and-swap instrucUons. The fixed-point compare-and-branch insiruciions 
within the instruction set shown in Table I provide the necessary arithmetic tests 
for equality and inequality of signed and unsigned fixed-point values. The bnuich 
through gateway instruction provides a secure means to access code at a higher 
privUeged level in a form similar to a high level language procedure call generally 
known in the art. 

The general purpose media processor 12 also preferably supports 
floating-point compare-and-branch instructions. TTie arithmetic operations, which 
are supported in hardware, include floating-point addition, subtraction, 
multiplication, division and square root. The general purpose media processor 12 
preferably supporu other floating-point operations defined by the ANSI-IEEE 
floating-point standard tiirough the use of software libraries. A floating point 
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value can preferably be 16, 32, 64 or 128-bits wide. Examples of the presenting 
preferred floating-point data sizes are illustrated in FIG. 9(b). 

The general purpose media processor 12 preferably suppons virtual 
memory addressing and viituai machine operation through a memory management 
unit 122. Referring to FIG. 10(a), one presently preferred embodiment of the 
memory managemem unit 122 is shown. The memory management unit 122 
preferably translates global virtual addresses into physical addresses by software 
programmable routines augmented by a hardware translation lookaside buffer 
("TLB-) 148. A facility for local virtual address translation 164 is also preferably 
provided. As those skilled in the an will appreciate, the memory management 
unit 122 includes a data cache 166 and a tag cache 168 that store dau and tags 
associated with memory sections for each entry in the TLB 148. 

A block diagram of one preferred embodiment of the TLB 148 is 
shown in HG. 10(b). The TLB 148 receives a vimial address 230 as its input. 
For each entry in the TIB 148, the viimal address 230 is logicaUy AND-ed with a 
mask 232. The output of each respective AND gate 234 is compared via a 
comparator 236 with each entry in the TLB 148. If a match is detected, an output 
from the comparator 236 is used to gate data 240 through a transceiver 238. As 
those skilled in the art will appreciate, a match indicates the entry of the 
corresponding physical address within the contents of the TLB 148 and no external 
memory or I/O access is required. The data 240 for the data cache 166 (FIG. 
10(a)) is then combined with the remaining lower bits of the virtual addiess 230 
through an exclusive-OR gate 242. The resultant combination is the physical 
address 244 output from the TLB 148. If a match is not detected between the 
25 logical address and the contents of the tag cache 168, the memory management 
unit 122 an external memory or I/O access is necessary to retrieve the relevant 
portion of memory and update the contents of the TLB 148 accordingly. 

Using generally known memory management techniques, the 
memory management unit 122 ensures that instructions (and data) arc properly 
30 retrieved from external memory (or other sources) over an external input/output 
bus 126 (see FIG. 7). As described in more detail below, a high bandwidth 
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interface 124 is coupled to the external input/output bus 126 to communicate 
instructions (and media data streams) to the general puipose media processor 12. 
Tlie presently prefeired physical address width for the general purpose media 
processor 12 is eight bytes (64-bits). In addition, the memory management unit 
122 preferably provides match bits (not shown) that allow large mcmoiy regions to 
be assigned a single TLB entry aUowing for fine grain memory management of 
large memory sections. The memory management unit 122 also piefeiably 
includes a priority bit (not shown) that aOows for preferential queuing of memory 
areas according to respective levels of priority. Other memory management 
operations generally known in the an arc also performed by the memory 
management unit 122. 

Referring again to FIG. 7, instructions received by the general 
purpose media processor 12 are stored in a combined instruction buffer/cache 1 1 8. 
The instruction buffer/cache 118 is dynamically subdivided to store the largest 

15 sequence of instructions capable of execution by Ute execution unit 100 without the 
necessity of accessing external memory. In a preferred embodiment of the 
invention, instruction buffer space is allocated to Uie smallest and most frequemly 
executed blocks of media instructions. Tlie instruction buffer tiius helps maintain 
the high bandwidtij capacity of the general puipose media processor 12 by 

20 sustaining the number of instiuctions executed per second at or near peak 

operation. Tluit portion of tiie instruction buffer/cache 1 1 8 not used as a buffer is, 
therefore, available to be used as cache memory. The instruction buffer/cache 118 
is coupled to the instruction path 112 and is preferably 32 kUobytes'in size. 

A data buffer/cache 120 is also provided to store data tiajismined 

25 and received to and from the execution unit 100 and register file 1 10. The data 
buffer/cache 120 is also dynamically subdivided in a manner simUar to that of the 
instruction buffer/cache 118. The buffer portion of the data buffer/cache 120 is 
optimized to store a set size of unified media data capable of execution without the 
necessity of accessing external memory. In a preferred embodiment of the 

30 invention, data buffer space is aUocated to the smallest and most frequently 

accessed working sets of media dau. Like the instruction buffer, tiie data buffer 
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thus maintains peak bandwidth of the general purpose media processor 12. The 
data buffer/cache 120 is coupled to the data path 108 and is preferably also 32 
kilobytes in size. 

The preferred embodiment of the general puipose media processor 
12 includes a pipelined instruction pre-fetch structure. Although pipelined 
operation is supponed. the general purpose media processor 12 also allows for 
non-pipelined operations to execute without any operational penalty. One 
preferred pipeline sinicture for the general purpose media processor 12 comprises 
a "super-string" pipeline shown in HG. 11. A super-string pipeline is designed to 
fetch and execute several instructions in each clock cycle. The instructions 
available for the general purpose media processor 12 can be broken down into five 
basic steps of operation. These steps include a register-to-register address 
calculation, a memory load, a register-to-register data calculation, a memory store 
and a branch operation. According to the super-string pipeline organization of the 
15 general puipose media processor 12. one instruction from each of these five types 
may be issued in each clock cycle. The presently preferred ordering of these 
operations are as listed above where each of the five steps are assigned letters 
-A," -L," -E." "S" and "B" (see FIG. 11). 

According to the super-string pipelining technique, each of the 
20 instnicuons are serially dependent, as shown in HG. 1 1. and the general purpose 
media processor 12 has the ability to issue a string of dependent instructions in a 
single clock cycle. These instructions shown in HQ. 11 can take from two to five 
cycles of latency to execute, and a branch prediction mechanism is preferably used 
to keep up the pipeline filled (described below). Instructions can be encoded in 
25 unit categories such as address, load, store/sync, fixed, float and branch to aUow 
for easy decoding. A similar scheme is employed to pre-fetch data for the general 
puipose media processor 12. 

As those skiUed in the an will appreciate, the super-string pipeline 
can be implemented in a multi-threaded environment. In such an implementation, 
30 the number of threads is preferably relatively prime with respect to functional unit 
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rates so that functional units can be scheduled in a non-interfering fashion between 
each thread. 

In another more preferred embodiment, a "super- spring" pipelining 
scheme is employed with the general purpose media processor 12. The super- 
spring pipeline technique breaks the super-string pipeline shown in HG. 1 1 into 
two sections that are coupled via a memory buffer (not shown). A visual 
representation of the super-spring pipeline technique is shown in FIG. 12. The 
front of the pipeline 204. in which address calculation (A), memory load (L). and 
branch (B) operations are handled, is decoupled from the back of the pipeline 206, 
in which data calculaUon (E) and memory store (S) operations are handled. The 
decoupling is accompUshed through the memory buffer (not shown), which is 
preferably organized in a first-in-first-out ("FIFO") fast/dense stnicmre. (The 
memory buffer is ftmciionaliy represented as a spring in FIG. 12 ) 

As indicated in Table I above, the general purpose media processor 
12 does not include delayed branch instructions, and so relies upon branch or fetch 
prediction techniques to keep the pipeline fijU in program flows around 
uncondiUonal and conditional branch instructions. Many such techniques are 
generaUy known in the an. Examples of some presenUy preferred techniques 
include the use of group compare and set. and multiplex operations to eliminate 
unpredictable branches; the use of short forward branches, which cause pipeline 
neutralization; and where branch and link predicts the return address in a one or 
more entry stack. In addition, the specialized gateway instnictions included in the 
general purpose media processor 12 allow for branches to and from protected 
virtual memory space. The gateway instructions, therefore. aUow an efficient 
25 means to transfer between various levels of privilege. 

As described above, two basic forms of media data are processed by 
the general puipose media processor 12. as shown in HG. 7. These data streams 
genetaUy comprise Nyquist sampled I/O 128, and standard memory and I/O 130. 
As shown in FIG. 7. audio 132, video 134, radio 136, network 138, tape 140 and 
30 disc 142 data streams comprise some examples of digitally sampled I/O 128. As 
those skilled in the art will appreciate, other forms of digitally sampled I/O are 
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contemplated for processing by the general purpose media processor 12 without 
departing from the spirit and scope of the invention. Standard memory and I/O 
130 comprises data received and transmined to and from general digital peripheral 
devices used in the design of most computer systems. As shown in HG. 7. some 
examples of such devices include dynamic random access memory ("DRAM") 
146, or any data received over the PCI bus 144 generally known in the an. Other 
forms of standard memory and I/ O sources are also contemplated. Hie various 
fixed-point data sizes preferred for the general puipose media processor 12 are 
illustiated in FIG. 9(c). 



External Interfare 

As mentioned above, the general purpose media piocessor 12 
includes a high bandwidth interface 124 to communicate with external memory and 
input/output sources. As pan of the high bandwidth interface 124, the general 

15 puipose media processor 12 imegmes several fast communication channels 156 
(FIG. 13) to communicate externally. These fast communication channels 156 
preferably couple to external caches 150, which serve as a buffer to memory 
interfaces 152 coupled to standard memoiy 154. The caches 150 preferably 
comprise synchronous static random access memory ("SRAM"), each of which are 

20 sixty-four kilobytes in size; and the standard memories 154 comprise DRAM's. 

The memory interfaces 152 transmit data between the caches 150 and the standard 
memories 154. TTie standanl memories 154 together form the main external 
memory for the general puipose media processor 12. The cache 150, memory 
interi^ace 152, standard memoiy 154 and input/output channel 156 therefore make 

25 up a single external memory unit.158 for the general puipose media processor 12. 

According to the presently preferred embodiment of the invention, 
the memory interface protocol embeds read and write operations to a single 
memory space into packets containing command, address, data and 
acknowledgment infonnation. Tlie packets preferably include check codes that 

30 will detect single-bit transmission errors and some multiple-bit errors. As many as 
eight operations may be in progress at a time in each external memory unit 158. 
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As shown in HG. 13. up to four external memory units 158 may be cascaded 
together to expand the memoiy available to the geneial puipose media pioccssor 
12, and to improve the bandwidth of the external memory. Through such 
cascaded memory units 158. the memory interface 152 provides for the direct 
5 comiection of multiple banks of standard memory 154 to maintain operation of the 
general purpose media processor 12 at sustained peak bandwidths. 

According to one embodiment shown in FIG. 13. up to four 
standard memory devices 154 can be coupled to each memory interface 152. Each 
standard memory 154 thus includes as many as four banks of DRAM, each of 
10 which is preferably sixteen bits wide. The standard memories 154 are connected 
in paraUel to the memory interface 152 forming a 72-bit wide data bus 160. where 
64 bits are preferably provided for data transfer and eight bits are provided for 
error correction. In addition to the data bus 160, an address/control bus 162 is 
coupled between the memory interface 152 and each standard memory 154. The 
15 address/control bus 162 preferably comprises at least twelve address lines (4 
kilobits X 16 memory size) and four control lines as shown in FIG. 13. An 
alternate manner for coupling the DRAM's to the memory interface 152 is 
Ulustmed in HO. 14. As shown in FIG. 14, two banks of four DRAM single in- 
line memory modules are coupled in parallel to the memory interface 152. The 
memory interface 152 also suppons interleaving to enhance bandwidth, and page 
mode accesses to improve latency for localized addressing. 

Using standard DRAM components, the external memoiy units 158 
achieve bandwidths of approximately two gigabits/second with the standard 
memories 154. When four such external memory units 158 are coupled via the 
25 communication channel 156, therefore, the total bandwidth of the external main 
memory system increases to one gigabyte/second. As discussed farther below, in 
implementations with two or eight communication channels 156, the aggregate 
bandwidth increases to two and eight gigabytes/second, lespeciively. 

A more detailed depiction of the communication channel 156 
30 circuitry appears in FIG. 15. According to the preferred embodiment of the 

invention, each communication channel 156 comprises two unidirectional, byte- 
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Wide, differentiaJ, packet-oriented data channek 156a. I56b (sec FIG. 13). As 
explained above, where memory units 158 are cascaded together in series, the 
output of one memory unit 158 is connected to the input of another memory unit 
158. The two unidireaional channels are thus connected through the memory 
units 158 forming a loop stnicmre and make up a single bi-directional memory 
interface channel. 

Referring to FIG. 15. each communication channel 156 is preferably 
eight bits wide, and each bit is transmined differentially. For example, output 
transceiver 170 for bit transmits both Do and /EV, signals over the 
communication channel 156. Additional transceivers are similarly provided for the 
remaining bits in the channel 156. (The transceiver 176 for bit and 
associated differenual lines 178. 180 are shown in HQ. 15.) A CLK„. transceiver 
182 is also provided to generate differential clock outputs 184, 186 over the 
channel 156. To complete the link between memory units 158, input transceivers 
15 188-192 are provided in each memory unit 158 for each of the differential bits and 
clock signals transmitted over the communication channel 156. These input 
signals 172, 174, 178, 180, 184. 186 are preferably transmitted through input 
buffers 194-198 to other parts of the memory unit 158 (described above). 

Each memory unit 158 also includes a skew calibrator 200 and 
phase locked loop ("PLL") 202. The skew calibrator 200 is used to control skew 
in signals output to the communication channel 156. Preferably, digital skea 
fields are employed, which include set numbers of delay stages to be inserted in 
the ouq)ui path of the communication channel 156. Setting these fieWs. and the 
corresponding analog skew fields, permits a fine level of control over the relative 
25 skew between output channel signals. 

The PLL 202 recovers the clock signal on either side of the 
communication channel 156 and is thus provided to remove clock jitter. The clock 
signals 184, 186 preferably comprise a single phase, constant rate clock signal. 
The clock signals 184, 186 thus contain alternating zero and one values transmined 
30 with the same timing as the data signals 172, 174, 178, 180. The clock signal 
frequency is, therefore, one-half the byte data rate. The communication channel 
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156 preferably operates at constant frequency and contains no auxiliary contiol. 
handshaking or flow control infonnation. 

Each external memory unit 158 preferably defines two ftjnaional 
regions: a memory region, implemented by the &che 150 backed by standard 
memory 154 (see HG. 13), and a configuration region, implemented by registers 
(not shown). Both regions are accessed by separate interfaces; the communication 
channel 156 is used to access the memory region, and a serial interface (described 
below) is used to access the configuration region. In the memory n^on, the 
caches 150 are preferably wiite-back (write-in) single-set (direa-map) caches for 
data originally contained in sumdard memory 154. AU accesses to memory space 
should maintain consistency between the contents of the cache 150 and the 
contents of the standard memory 154. The configuration region registers provide 
the mechanism to detect and adjust skew in the communication channel 156. 
Software is preferably employed to adaptively adjust the skew in the channel 156 
15 through digital skew fields, as explained above. The serial interface thus is used 
to configure the external memory units 158, set diagnostic modes and read 
diagnostic information, and to enable the use of a high-q)eed tester (not shown). 

One presently preferred embodiment of the invention employs two 
byte-wide packet communication channels 156 (HG. 16(a)). In order to farther 
20 increase the bandwidth of the general purpose medU processor 12, up to sixteen 
byte-wide packet communication channels 156 can be employed. Referring to 
no. 16(b), twelve communication channels, comprising eight memory channels 
210, a ninth channel for parallel processing 212 (described below), and three 
input/output ("I/O") channels 214. are shown. Each of the communication 
25 channels 210-214 preferably employs the cascade configuration of four channel 

interface devices 216. (Each channel interface device 216 coupled to the memory 
channels 210 corresponds to the external memory unit 158 shown in FIG. 13.) 
Through each of the twelve communication channels shown in FIG. 16(b), the 
general purpose media processor 12 can request or issue read or write 
30 transactions. When not interleaved, the twelve channels provide a single 
contiguous memory space for each channel interface device 216. 
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Alternatively, memory accesses may be interleaved in order to 
provide for continuous access to the external memory system at the maximum 
bandwidth for the DRAM memories. In an interleaved configuration, at any point 
in time some memory devices will be engaged in row pre-charge. while others 
5 may be driving or receiving data, or receiving row or column addresses. The 
memory interface 152 (HG. 13) thus preferably maps between a contiguous 
address space and each of the separate address spaces made available within each 
external memory unit 158. For maximum performance, therefore, the memory 
interface is interleaved so that references to adjacent addresses are handled by 

10 different memory devices. Moreover, in the preferred embodiment, additional 
memory operations may be requested before the corresponding DRAM bank is 
available. In ah interieaved approach, these operations are placed in a queue untU 
they can be processed. According to the preferred embodimem, memory writes 
have lower priority dian memory reads, unless an attempt is made to read an 

15 address that is queued for a write operation. As those skilled in the art will 
appreciate, the depth of the memory write queue is dictated by the specific 
implementation. 

Although up to four external memory units 158 are preferably 
cascaded to form effectively larger memories, some amount of latency may be 

20 introduced by the cascade. Packets of data transmitted over the communication 
channel 156 are uniquely addressed to a particular channel interface device 216. 
A packet received at a particular device, which specifies another module address. 
U automatically passed to the correct channel interface device 216. UiUcss the 
module address matches a particular device 216, that packet simply passes from 

25 the input to the output of the interface device 216. This mechanism divides the 
serial interconnection of intetfece devices 216 into strings, which function as a 
single larger memory or peripheral, but with possibly longer response latency. 

In addition to the memory channels 210, the general purpose media 
processor 12 provides several communication channels 214 for communication 

30 with external input/output devices. Referring to HG. 16(b), thiee input/output 
channels 214 having SRAM buffered memory (see HG. 13) provide an interface 
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to extemaJ standard I/O devices (not shown). Like the eight memory channels 
210. the three I/O channels 214 are byte-wide input/output channels intended to 
operaie at rates of at least one gigaheru. The three I/O channels 214 also operate 
as a packet communication link to synchronous SRAM memory 208 within the 
channel interface device 216. A controUer 226 within the channel interface device 
216 completes the interface to the I/O devices. 

The three I/O channels 214 preferably function in like manner to the 
memory channels 210 described above. The interface protocol for the three I/O 
channels 214 divides read and write operaUons to a single memory space into 
packets containing command, address, data and acknowledgment information. The 
packets also include a check code that will detect single-bit transmission errors and 
some multiple-bit errors. According to the preferred embodiment of the invention, 
as many as eight operations may progress in each interface device 216 at a time. 
As shown in FIG. 16(b), up to four channel interface devices 216 can be cascaded 
15 together to expand the bandwidth in the three I/O channels 214. A bit-serial 

interface (not shown) is also provided to each of the channel interface devices 216 
to allow access to configuration, diagnostic and tester information at standard TTL 
signal levels at a more moderate data rate. (A more detailed description of the 
serial interface is provided below). 

Like the memory channels 210. each I/O channel 214 includes nine 
signals - one clock signal and eight data signals. Differential voltage levels are 
preferably employed for each signal. Each channel interface device 216 is 
preferably terminated in a nominal 50 ohm impedance to ground. This impedance 
a^Jlies for both inputs and outputs to the communication channel 156. A 
25 programmable termination impedance is preferred. 

Interface Communicarinn 

According to one presently preferred embodiment of the invention, 
the channel interface devices 216 can operate as cither master devices or slave 
devices. A master device is capable of generating a request on the communication 
channel 156 and receiving responses from the communication channel 156. Slave 
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devices are capable of receiving requests and generating responses, over the 
communication channel 156. A master device is preferably capable of generating 
a constant frequency clock signal and accepting signals at the same clock 
frequency over the communication channel 156. A slave device, therefore, should 
operate at the same clock rate as the communication chamiel 156. and generate no 
more than a specified amount of variation in output clock phase relative to input 
clock phase. n,e master device, however, can accept an arbitrary input clock 
phase and tolerates a specified amoum of variation in clock phase over operating 
conditions. 

Packets of infoimation sent over the communication channel 156 
preferably contain contiol commands, such as read or write operations, along with 
addresses and associated data. Other commands are piDvided to indicate error 
conditions and responses to the above commands. When the communication 
channel 156 is idle, such as during initialization and between transmitted packets, 
an idle packet, consisting of an all-zero byte and an all- one byte is transmitted 
through the communication channel 156. Each non-idle packet consists of two 
bytes or a multiple of two bytes, and begins with a byte having a value other than 
aU zeros. All packets tiansmined over the communication channel 156 also begin 
during a clock period in which the clock signal is zero, and all packets preferably 
end during a clock period in which the clock signal is one. A depiction of the 
preferred packet protocol format for transmission over the communication channel 
156 appears in FIG. 17. 

The general form of each packet is an array of bytes preferably 
without a specific byte ordering. The first byte contains a module address 250 
25 ("ma") in the high order two bits; a packet identifier, usually a command 252 
{"com"), in the next three bit positions; and a link identification number 254 
("Ud") in the last three bit positions. The interpretation of the remaining bytes of 
a packet depend upon the contents of the packet identifier. The length of each 
packet is preferably implied by the command specified in the initial byte of the 
30 packet. A check byte is provided and computed as odd bit-wise parity with a 
leftward circular rotation after accumulating each byte. This technique provides 
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detection of aU single-bit and some multiple-bit errors, but no correction is 
provided. 

The modular address 250 field of each packet is preferably a two-bit 
field and allows for as many as four slave devices to be operated from a single 
communication channel 156. Module address values can be assigned in one of 
two fashions: either dynamically assigned through a configuration register (not 
shown), or assigned via static/geometric configuration pins. Dynamic assignment 
through a configuration register is the presently preferred method for assigning 
module address values. 

The link identification number 254 field is preferably 3-bits wide 
and provides the opportunity for master devices to initiate as many as eight 
independent operations at any one time to each slave device. Each outstanding 
operation requires a distinct link identification number, but no ordering of 
operations should be impUed by the value of the link identification field. Thus, 
15 there is preferably no requirement for link identification values 254 to be 
sequentially assigned eitiier in requests or responses. 

The receipt of packets over the communication channel 156 that do 
not conform to the channel protocol preferably generates an error condition. As 
those skilled in the an wiU appreciate, the level or degrees to which a specific 
20 implementation detects errors is defmed by the user. In one presently prefenwl 
embodiment of the invention, all errors are detected, and the foUowing protocol is 
employed for handling errors. For each error detected, the channel interface 
device 216 causes a response expUcitiy indicating the error condition. Channel 
interfece devices 216 reporting an invaUd packet will then suppress the receipt of 
25 additional packets until the error is cleared. The transmitted packet is otherwise • 
ignored. However, even though Ute erroneous packet is ignored, the channel 
interface devices 216 preferably continue to process valid packets that have already 
been received and generate responses thereto. An identification of the presently 
preferred commands 252 to be used over the communication channel 156 are listed 
30 in HG. 17. 
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In the master/slave preferred embodiment, the channel interface 
devices 216 forward packets that are intended for other devices connected to the 
communication channel 156. as described above. In slave devices, forwarding is 
performed based on the module address 250 Held of the paclcet. Pacicets which 
contain a module address 250 other than that of the current device are fonvarded 
on to the next device. AU non-idle packets are thus forwarded including error 
packets. In master devices, fomarding is perfoimed based on the link identifier 
number 254 of the packet. Packets that contain link identifier numbers 254 not 
generated by the specific channel interface device 216 are forwarded. In order to 
reduce transmission latency, a packet buffer may be provided. As those skilled in 
the art appreciate, the suitable size for the packet buffer depends on the amount of 
latency tolerable in a panicular implementation. 

A variety of master/slave ring configurations are possible using the 
high bandwidth interface 124 of the invention. Five ring configurations are 
currenUy preferred: single-master, dual-master, multiple-master, single-slave and 
multiple-master/multiple-slave. The simplest ring configuration contains a single 
non-forwarding master device and a single non-forwarding slave device. No 
forwarding is required for either device in this configuiation as packets are sem 
directly to the recipicm. A single-master ring, however, may contain a cascade of 
20 up to four slave devices (see HGS. 13, 16). In the single-master ring 

configuration, each slave device is configured to a distinct module address, and 
each slave device forwards packets that contain module address fields unequal to 
their own. As discussed above, a single-master ring piDvides a larger memory or 
I/O capacity than a master-slave pair, but also introduces a potemially longer 
25 response latency. In the single-master ring, each slave device may have as many 
as eight transactions outstanding at any time, as described above. 

The remaining combinations share many of the above basic 
attributes. In a dual-master pair, each master device may initiate read and write 
operations addressed to the other, and each may have up to eight such transactions 
30 outstanding. No forwarding is required for either device because packets are sent 
diiecUy to the recipient. A multiple-master ring may contain multiple master 
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devices and a single slave device. In this configuration, the slave device need not 
forward packets as all input packets are designated for the single slave device. A 
multiple-master ring may contain multiple master devices and as many as four 
sUve devices. Each slave device may have up to eight tnmsactions outstanding, 
and each master device may use some of those transactions. In a preferred 
embodiment, a master also has the capability to detect a time-out condition or 
when a response to a request packet is not received. Further aspects of inter- 
processor communications and configurations are discussed below in connection 
with FIG. 18. 



In one preferred embodiment of the invention, the general purpose 
media processor 12 includes a serial bus (not shown). The serial bus is designed 
to provide bootstrap resources, configuration, and diagnostic support to the general 
purpose media processor 12. The serial bus preferably employs two signals, both 
at TTL levels, for direct communication among many devices. In the preferred 
embodiment, the first signal is a continuously running clock, and the second signal 
is an open-collector bi-directional data signal. Four additional signals provide 
geographic addresses for each device coupled to the serial bus. A gateway 
20 protocol, and optional configurable addressing, each provide a means to extend the 
serial bus to other buses and devices. Although the serial bus is designed for 
implementation in a system having a general purpose media processor 12, as those 
skilled in the art will appreciate, the serial bus is appUcable to other systems as 
weU. 

2^ Because the serial bus is preferably used for the initial bootstrap 

program load of the general purpose media processor 12, the bootstrap ROM is 
coupled to Uie serial bus. As a result, the serial bus needs to be operational for 
tiie fint instruction fetch. The serial bus protocol is therefore devised so that no 
transactions are required for initial bus configuration or bus address assignment. 

^0 According to tiie preferred embodiment, the clock signal comprises 

a continuously running clock signal at a minimum of 20 megahertz. The amount 
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of skew, if any, in the clock signal between any two serial bus devices should be 
limited to be less than the skew on the data signal. Preferably, the serial data 
signal is a non-invened open coUector bi-directional data signal. TTL levels are 
preferred for communication on the serial bus, and several termination networks 
5 may be employed for the serial data signal. A simple preferred termination 
network employs a resistive puU-up of 220 ohms to 3.3 volts above V„. An 
alternate embodiment employs a more complex termination network such as a 
termination network including diodes or the "Forced Perfea Teimination" network 
proposed for the SCSI-2 standard, which may be advantageous for larger 
10 configurations. 

The geographic addressing employed in the serial bus is provided to 
insure that each device is addressable with a number that is unique among aU 
devices on the bus and which also preferably reflects the physical location of the 
device. Thus, the address of each device remains the same each time the system 
is operated. In one preferred embodiment, the geographic address is composed of 
four bits, thus allowing for up to 16 devices. In order to extend the geographic 
addressing to more than 16 devices, additional signals may be employed such as a 
buffered copy of the clock signal or an inverted copy of the clock signal (or both). 

The serial bus preferably incoiporates both a bit level and packet 
20 protocol, nie bit level protocol aUows any device to transmit one bit of 

information on the bus, which is received by aU devices on the bus at the same 
time. Each transmitted bit begins at the rising edge of the clock signal and ends at 
the next rising edge. The transmitted bit value is sampled at the next rising edge 
of the clock signal. According to one preferred embodiment where the serial data 
25 signal is an open coUector signal, the transmission of a zero bit value on the bus is 
achieved by driving the serial data signal to a logical low value. In this 
embodiment, the transmission of a one bit value is achieved by releasing the serial 
data signal to obtain a logical high value. If more than one device attempts to 
transmit a value on the same clock, the resulting value is a zero if any device 
transmits a zero value, and one if all devices transmh a one value. This provides 
a "wired-AMD" collision mechanism, as those skilled in the art will appreciate. If 
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two or more devices transmit the same value on the same clock cycle, however, 
no device can detect the occurrence of a collision. In such cases, the tnuisaction, 
which may occur frequently in some implementations, preferably proceeds as 
described below. 

The packet protocol employed with the serial bus uses the bit level 
protocol to transmit information in units of eight bits or multiples of eight bits. 
Each packet transmission preferably begins with a start bit in which the serial data 
signal has a zero (driven) value. After transmitting the eight data bits, a parity bit 
is transmitted. The transmission continues with additional data. A single one 
(released) bit is transmitted immediately foUowing the least significant bit of each 
byte signaUng the end of the byte. 

On the cycle foUowing the transmission of the parity bit, any device 
may demand a delay of two cycles to process the data received. The two cycle 
delay is initiated by driving the serial data signal (to a zero value) and releasing 
the serial data signal on the next cycle. Before releasing the serial data signal, 
however, it is preferable to insure that the signal is not being driven by any other 
device. Further delays are available by repeating this pattern. 

In order to avoid collisions, a device is not permitted to start a 
transmission over the serial bus unless there are no cuirently executing 
transactions. To resolve collisions that may occur if two devices begin 
transmission on the same cycle, each transmitting device should preferably monitor 
the bus during the transmission of one (released) bits. If any of the bits of the 
byte are received as zero when transmitting a one, the device has lost arbitration 
and must cease transmission of any additional bits of the current byte or 
25 transaction. 

According to the preferred embodiment of the invention, a serial 
bus transaction consists of the transmission of a series of packets. The transaction 
begins with a transmission by the transaction initiator, which specifies the target 
networic, device, length, type and payload of the transaction request. The 
transaction terminates with a packet having a type field in a specified range. As a 
result, aU devices connected to the serial bus should monitor the serial data signal 
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CO detennine when transactions begin and end. A serial bus network may have 
multiple simultaneous transactions occurring, however, so long as the target and 
initiator network addresses are all disjoint. 

5 Parallel t>mrt^«inp 

In one preferred embodiment of the invention, two or more general 
purpose media processors 12 can be linked together to achieve a multiple 
processor system. According to this embodiment, general purpose media 
processors 12 are linked together using their high bandwidth interface channels 

10 124, either direcUy or through external switching components (not shown). The 
dual-master pair configuration described above can thus be extended for use in 
multiple-master ring configurations. Preferably, internal daemons provide for the 
generation of memory references to remote processors, accesses to local physical 
memory space, and the transport of remote references to other remote processors. 

15 In a multi-processor environmem, all general purpose media processors 12 run off 
of a common clock frequency, as required by the communication channels 156 that 
coimect between processors. 

Referring to FIG. 18, each general purpose media processor 12 
preferably includes at least a pair of inter-processor links 218 (see also ¥IG. 

20 16(b)). In one configuration, both pairs of inter-processor links 218 can be 

connected between the two processors 12 to ftiiiher enhance bandwidth. As shown 
in HG. 18(a) several processors 12 may be interconnected in a linear network 
employing the transponder daemons in each processor. In an alternate 
embodiment shown in FIG. 18(b). the inter-processor links 218 may be used to 

25 join the general purpose media processors 12 in a ring configuration. 

Alternatively still, general purpose media processors 12 may be interconnected 
into a two-dimensional network of processors of arbitrary size, as shown in FIG. 
18(c). Sixteen processors are connected in FIG. 18(c) by connecting four ring 
networks. In yet another alternate embodiment, by connecting the inter-processor 

30 links 218 to external switching devices (not shown), multi-processors with a large 
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number of processors can be constructed with an arbitrary interconnection 
topology. 

The requester, responder and transponder daemons preferably 
handle all inter-processor operations. When one general purpose media pn)cessor 
5 12 attempts a load or store to a physical address of a remote processor, the 

requester daemon autonomously attempts to satisfy the remote memory reference 
by communicating with the external device. TTie extenud device may comprise 
another processor 12 or a switching device (not shown) that eventually reaches 
another processor 12. Preferably, two requester daemons are provided each 
processor 12, which aa concurrently on two diffeiem byte chamjels and/or module 
addresses. The responder daemon accepts writes from a specified channel and 
module address, which emtbles an external device to generate transaction requests 
in local memoiy or to generate processor events. The responder daemon also 
generates link level writes to the same external device that communicated 
responses for the received transaction request. Two such responder daemons are 
preferably provided; each of which operate concurrently to two different byte 
channels and/or module addresses. 

The transponder daemon accepts writes from a specified channel and 
module address, which enable an external device to cause a requester daemon to 
generate a request on another channel and module address. Preferably, two such 
transponder daemons are provided, each of which act concurrenUy (back-to-back) 
between two differem byte channel and/or module addresses. As those skiUed in 
the art will appreciate, the requester, responder and transponder daemons must act 
cooperatively to avoid deadlock that may arise due to an imbalance of requests in 
the system. Deadlocks prevem responses from being routed to their destinations, 
which may defeat the benefits of a multi-processor distributed system. 

According to one presentiy preferred embodiment of the invention, 
the general purpose media processor 12 can be implemented as one or more 
integrated circuit chips. Referring to FIG. 19, the presently preferred embodiment 
of tiie general purpose media processor 12 consists of a four-chip set. In the four- 
chip set, a general purpose media processor 12 is manufactured as a stand alone 
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integrated circuit. The stand alone integrated circuit includes a memory 
management unit 122. instruction and data cache/buffers 118, 120, and an 
execution unit 100. A pluraUty of signal input/output pads 260 are provided 
around the circumference of the integrated circuit to communicate signals to and 
5 from die general purpose media processor 12 in a manner generally known in the 
an. 

The second and third chips of the four-chip set comprise in an 
external memory elemem 158 and a chamiel interface device 216. The external 
memory elemem 158 includes an interfiice to the communication channel 156. a 
cache 150 and a memory interface 152. The channel interface device 216 also 
includes an interface to the communication channel 156, as weU as buffer memory 
262, and input/output interfaces 264. Both the external memory elemem 158 and 
the channel interface device 216 include a plurality of input/output signal pads 260 
to communicate signals to and from Uiese devices in a generaUy known manner. 

The founh integrated circuit chip comprises a switch 226, which 
aUows for instaUation of the general purpose media processor 12 in the 
heterogeneous network 38. In addition to Uie plurality of input/output pads 260, 
the switch 226 includes an interface to the communication channel 156. The 
switch 226 also preferably includes a buffer 262, a router 266, and a switch 
20 interface 268. 

As those skilled in die an wiU appreciate, many implementations for 
the general purpose media processor 12 are possible in addition to die four-chip 
implementation described above. Rather tiian an integrated approach, tire general 
purpose media processor can be implemented in a discrete manner. Alternatively, 
25 Uie general purpose media processor 12 can be implemented in a single integrated 
circuit, or in an implementation with fewer tiian four integrated circuit chips. 
Otiier combinations and permutations of these implementations are contemplated. 

There has been described a system for processing streams of media 
data at substantially peak rates to allow for real time communication over a large 
30 heterogeneous network. The system includes a media processor at its core that is 
capable of processing such media data streams. The heterogeneous network 
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10 



15 



consists of, for example, the fiber optic/coaxial cable/twisted wire network in 
place throughout the U.S. To provide for such communication of media data, a 
media processor according to the invention is disposed at various locations 
throughout the heterogeneous network. The media processor would thus fimction 
both in a server capacity and at an end user site within the network. Examples of 
such end user sites include televisions, set-top convener boxes, fecsimUe 
machines, wireless and ceUular telephones, as well as large and small business and 
industrial applications. 

To achieve such high rates of data throughput, the media processor 
includes an execution unit, high bandwidth interface, memory management unit, 
and pipelined instruction and data patiis. The high bandwidth interface includes a 
mechanism for transmitting media data streams to and from the media processor at 
rates at or above tiie gigahertz frequency range. The media data stream can 
consist of transmission, presentation and storage type data transmitted alone or in a 
unified manner. Examples of such data types include audio, video, radio, network 
and digital communications. According to tiie invention, tiie media processor is 
dynamically partitionable to process any combination or permutation of Uiese dau 
types in any size. 

A programmable, general puipose media processor system presents 
20 significant advantages over currem multimedia communications. Rather than 

rigid, costly and inefficient specialized processors, tiie media processor provides a 
general purpose instruction set to ease progiammability in a single device that is 
capable of performing aU of tiie operations of the specialized processor 
combination. Providing a uniform instruction set for aU media related operations 
25 eliminates tiie need for a progiammer to learn several different instruction sets, 
each for a different specialized processor. The complexity of programming Uie 
specialized processors to work together and communicate with one another is also 
greatiy reduced. The unified instruction set is also more efficient. Highly 
specialized general calculation instructions tiiat are tailored to general or special 
30 types of calculations rather than enhancing communication are eliminated. 
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Moreover, the media processor system can be easily reprognunmed 
simply by transmitting or downloading new software over the network. In the 
specialized processor approach, new programming usually requires the delivery 
and installation of new hardware. Reprogramming the media processor can be 
done electronically, which of course is quicker and less costly than the 
replacement of hardware. 

It is to be understood that a wide range of changes and 
modificaUons to the embodiments described above will be apparent to those skilled 
in the art and are contemplated. It is therefore intended that the foregoing detailed 
description be regarded as illustrative rather than limiting, and that it be 
understood that it is the following claims, including all equivalents, that are 
intended to define the spirit and scope of this invention. 

Set forth on the following pages (46-406) is a more detailed 
discussion of the operations, and the functionality of the general purpose media 
15 processor 12. 
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Introduction 

MicroUnity-s Terpsichore System Architecture describes general-purpose 
processor, memory, and interface subsystems, organized to operate at enormouslv 
higher bandwidth rates than traditional computers. 

Terpsichore's Euterpe processor performs integer, floating point, and signal 
processing operations at data rates up to 512 bits (i.e.. up to four 128.bit operand 
groups) per instruction. The instruction set design carries the concept of 
streamlining beyond Reduced Instruction Set Computer (RISC) architecture* 
since It targets implementations that issue several instructions per machine cycle." 

The Terpsichore memory subsystem provides 64-bit virtual and phvsical 
addressing tor UNIX. Mach. and other advanced OS environments. Caches 
supply the high data and instruction issue rates of the processor, and support 
coherency primitives for scalable multiprocessors. The raemorv subsvsten 
includes mechanisms tor sustaining high data rates nor only in block transfer 
modes, but also m non-unit stride and scatter/gather access patterns. 

Hermes channels provide 64-bit transfers between subsystem components with 
gigabyte-per-second bandwidth. Terpsichore's Cerberus serial bus provides a 
tlexible. robust and inexpensive mechanism to handle system initialization, 
conhguration availability, and error recovery. Mnemosyne memory interface 
devices provide for the integration of large numbers of industry-standard memorv 
components into Terpsichore systems. Persephone devices enable Terpsichore 
systems to utilize industry-standard PCI interface cards. 

Terpsichore's Calliope interface subsystem is tighdy integrated with the processor 
and memory, to supply both the bandwidth and real-time response needs of video 
audio, network and mass storage interfaces. Integration provides for the sharin* 
ot memory bandwidth among these devices and the processor, u-ithout distributed 
or dedicated butter memories in each interface adapter. 

Terpsichore's Euterpe processor incorporates Icarus interprocessor interfaces for 
assembly of small-scale, coherently-cached, shared-memory multiprocessors 
without additional circuitry. Icarus interfaces may also be used to connect 
Terpsichore processors to a high-performance switching fabric for large-scale 
multiprocessors, or to adapters to standard interprocessor interfaces, such as 
Scalable Coherent Interface (IEEE standard 1596-1992). 

The goal of the Terpsichore architecture is to integrate these processor, memory, 
and interface capabilities with optimal simplicity and generality. From the 
software perspective, the entire machine state consists of a program counter, a 
single bank of 64 general-purpose 64-bit registers, and a linear byte-addressed 
shared memory space with mapped interface registers. All interrupts and 
exceptions are precise, and occur with low overhead. 

This document is intended for Terpsichore software and hardware developers 
alike, and defines the interface at which their designs must meet. Terpsichore 
pursues the most efficient tradeoffs between hardware and software complexity 
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kLT^I^i P'^""^'^' "^^^ory- and inrerface resources directlv accessible to 
nigh-leveJ language programs. 
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Conformancf^ 

To ensure that Terpsichore systems are uble to treelv interchanee data, user-level 
programs, sysrem-level programs and interface devices, the Terpsichore svstem 
architecture reaches above the processor level architecture. 

Mandatory and Ontional AmR<=i 

A computer system conforms to the requirements of the Terpsichore Svstem 
Architecture ij and only it it implements all the specifications described in this 
document and other specifications included by reference. Conformance to the 
specihcation is mandatory m all areas, including the instruction set. memorv 
management svstem. interface devices and external interfaces, and bootstrap 
ROM functional requirements, except where explicit options are stated. 

Optional areas include: 

Number of processor threads 
Size of first-level cache memories 
Existence of a second-level cache 
Size of second-level cache memory 
Size of system-level memory 

Existence of certain optional interface device interfaces 

Conformance to the specification is also optional regarding the physical 
implementation of internal interfaces, specifically that of the Cerberus serial bus 
architecture, the Hermes high-bandwidth channel architecture, and the Icarus 
interprocessor interconnection architecture. An implementation may replace 
modify or eUminate these interfaces, provided that the software-level functionality 
IS unchanged. 

Uoward- compstihle Modifinntinn<^ 

From time to time, MicroUnity may modify the architecture in an upward- 
compatible manner, such as by the addition of new instructions, definition of 
reserved bits in system state, or addition of new standard interfaces. Such 
modifications will be added as options, so that designs which conform to this 
version of the architecture will conform to future, modified versions. 

Additional devices and interfaces, not covered by this standard may be added in 
specified regions of the physical memor\- space, provided that svstem reset places 
these devices and interfaces in an inactive state that does not interfere with the 
operation of software that runs in any conformant system. The software interface 
requirements of any such additional devices and interfaces must be made as 
widely available as this architecture specification. 
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Promotio n of Dntinnal FsRti im<^ 

It is most strongly recommended that such optional instructions, state or 
mtertaces be implemented in all conforming designs. Such implementations 
enhance the value of the features in particular and the architecture as a whole bv 
broadenmg the set ot implementations over which software mav depend upon the 
presence ot these teatures. ' ^ 

Implementations which fail to implement these features mav encounter 
unacceptable levels of overhead when attempting to emulate the features bv 
exception handlers or use ot virtual memcr.. This is a particular concern when 
involved m code which has real-time performance constraints. 

In order that upward -compatible optional extensions of the original Terpsichore 
system architecture may be relied upon by system and application software 
MicroLnity may upon occasion promote optional features to mandatory 
conformance for implementations designed or produced after a suitable delav 
upon such notification by publication of future version of the specUication. 

UnrestrintRd Phv<=iin^l Imolf^mfint^tinn 

Nothing in this specification should be construed to limit the implementation 
choices ot the conformant system beyond the specific requirements stated herein 
In particular, a computer system may conform to the Terpsichore Svstem 
Architecture while employing any number of components, dissipate anv amount 
ot heat, require any special environmental facilities, or be of any phv-sical'size. 

Draft VGrsinn 

This document is a draft version of the architectural specification. In this form 
conformance to this document may not be claimed or implied. MicroUnitv may 
r I °w?'^ speciticauon at any time, in any manner, until it has been declared 
final. When this document has been declared final, the onlv changes will be to 
correct bugs, detects or deficiencies, and to add upward-compatible optional 
extensions. 
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Common ElemGnfR 

Notation 



Tiptive notation used in this document is summarized in the table belou-: 



X + y 


iwww J wwii.pujiiicHi nuduny-poini aaaition of x ana y Result 
IS the same size as the operands, and operands must be of 
equal size. 


X - V 

^ y 


ivvwo ou. Mfj.cn ici 11 uf iioaung-point SUDtraction of y from x 
Result IS the same size as the operands, and operands must be 
of equal size. 


A y 


LWuS L^ucripiemeni or Tioating-point multiplication of x and y 
Result IS the same size as the operands, and operands must be 
of equal size. 


Y / W 

A / y 


iwo i, uumpiemeni or iioating-point division of x by y. Result is 
the same size as the ct:erancls and ooerands fnii«;f ho nf omiai 
size. 


x = y 


two s complement or rioating-point equality comparison 
between x and y. Result Is a single bit. and ooerands must be 
Of equal size. 


X * y 


two s complement or floating-point inequality comoarison 
between x and y. Result is a single bit. and operands must be 
of equal size. 


X < y 


two s complement or rloating-point less than comparison 
between x and y. Result is a single bit. and operands must be 
of equal size. 


X >y 


two s complement or floating-point greater than or equal 
comparison between x and y. Result is a single bit. and 
operands must be of eaual size 


Vx 


floating-point square root of x 


X II y 


concatenation of bit tield x to left of bit field y 


xy 


binary digit x repeated, concatenated y times. Size of result is 

y. 


Xy 


extraction of bit y (using little-endian bit numbering) from 
value X. Result is a sinoie bit. 




extraction of bit field formed from bits y through z of value x 


x?y:2 


value of y. if x is true, otherwise value of z. Value of x is a 
single bit. 


x 4- y 


bitwise assignment of x to value of y 


Sn 


signed, two s complement, binary data format of n byies 


Un 


unsigned binary data format of n bytes 


Fn 


floating-point data format of n bytes 



descriptive notation 
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Bit ordering 

The ordering of bits in this document is always little-endian, regardless of the 
ordering of bytes within larger data structures. Thus, the least-significant bit of a 
data structure is always labeled 0 (zero), and the most-significant bit is labeled as 
the data structure size (in bits) minus one. 

Memory 

Terpsichore memory is an array of 26-' bnes, without a specified byte ordering, 
which is physically distributed among various Terpsichore components. 

7 0 

byte 0 
byte 1 
byte 2 



byte 2^-1 

B 

Bvte 

A byte is a single element of the memor\- array, consisting of 8 bits: 

7 0 

r byte I 
8 

Bvte ordering ■ 

Larger data structures are constructed from the concatenation of bytes in either 
little-endian or big-endian byte ordering. A memory access of a data structure of 
size s at address i is formed from memory bytes at addresses i through i-i-s-1. 
Unless otherwise specified, there is no specific requirement of alignment: it is not 
generally required that i be a multiple of s. Aligned accesses are preferred 
whenever possible, however, as they will often require one less processor or 
memory clock cycle than unaligned accesses. 

With little-endian byte ordering, the bytes are arranged as: 

s-8-1 s*8-8 15 8 7 0 

I byte i+s-l | »■ | byte i+1 | byte i | 

8 8 8 
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With big-endian byte ordering, the bytes are arranged as: 
s*8'1 s'8-8 s'8-9 s'8-16 



I g 



I byte i \ byte i-t-l | 



t byte i-fs»1 | 



8 8 



8 



Terpsichore memory is b>ie-addressed, using either little-endian or big-endian 
byte ordering. For consistency with the bit ordering, and for compatibOitv with 
x86 processors, Terpsichore uses little-endian byte ordering when an ordering 
must be selected. Euterpe load and store instructions are available for both little- 
endian and big-endian byte ordering. The selection of byte ordering is dvnamic, so 
that little-endian and big-endian processes, and even data structures within a 
process, can be intermixed on the processor. 

Memory read/losd semantic^^ 

Terpsichore memor>', including memory-mapped registers, must conform to the 
following requirements regarding side-effects of read or load operations: 

A memory read must have no side-effects on the contents of the addressed 
memory nor on the contents of any other memory. 

Memory write/store semsntinc^ 

Terpsichore memory, including memor>'-mapped registers, must conform to the 
followmg requirements regarding side-effects of read or load operations: 

A memor>' write must have no side-effects on the contents of the addressed 
memory. A memory write may cause side-effects on the contents of memon- not 
addressed by the write operation, however, a second memory write of the same 
value to the same address must have no side-effects on any memorv: memorv 
write operations must be idempotent. 

Euterpe store instructions which are weakly ordered may have side-effects on the 
contents of memory not addressed by the store itself; subsequent load instructions 
which are also weakly ordered may or may not return values which reflect the 
side-effects. 



Euterpe provides eight-byte (64-bit) virtual and physical address sizes, and eight- 
byte (64-bit) and sixteen-byte (128-bit) data path sizes, and uses fixed-length four- 
byte (32-bit) instructions. Arithmetic is performed on two's-complement or 
unsigned binary and ANSI/IEEE standard 754-1985 conforming binary floating- 
point number representations. 



Data 
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Fixed-DOint Data 
Bit 

A bit is a primitive data element: 

0 
1 

Peck 

A peck is the catenation of two bits: 

1 0 
2 

Nibble 

A nibble is the catenation of four bits: 



3 0 

T 



Bvte 



A byte is the catenation of eight bits, and is a single element of the memory array: 

7 0 

I byte I 
8 

Doublet 

A doublet is the catenation of 16 bits, and is the concatenation of two bytes: 

15 0 
r doublet I 

16 

Ouadlet . 

A quadlet is the catenation of 32 bits, and is the concatenation of four bjrtes: 

31 0 
I quadlet | 

32 
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Octlet 

An ocdet is che catenation of 64 bits, and is the concatenation of eight bytes: 

63 32 

I octlet63..32 "| 



32 

31 



C 



octlet3i..o 



32 

Hexlet 

A hexlet is the catenation of 128 bits, and is the concatenation of sixteen bvtes 



127 

I hex'eti27.'^ 



96 



32 



95 

I hexlet95..ir 



64 



32 



63 

I hfeXlet63..3T 



32 



32 



31 

I hexiet3l..o 



32 

Address 

Terpsichore addresses are octlet quantities. 
Floatina-ooint Data 

Terpsichore's floating-point formats are designed to satisfy ANSI/IEEE standard 
754-1985: Binary Floating-point Arithmetic. Standard 754 leaves certain aspects to 
the discretion of the implementor: 

Terpsichore adds additional half-precision and quad-precision formats to 
standard 754*s single-precision and double-precision formats. Terpsichore's 
double-precision satisfies standard 754's precision requirements for a single- 
extended format, and Terpsichore's quad-precision satisfies standard 754's 
precision requirements for a double-extended format. 

Quiet NaN values are denoted by any sign bit value, an exponent field of all one 
bits, and a non-zero significand with the most significant bit cleared. Quiet NaN 
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values generated by default exception handling of standard operations have a zero 
sign bit, an exponent field of all one bits, and a significand field with the most 
significant bit cleared, the next-most significant bit set. and all other bits cleared. 

Signaling NaN values are denoted by any sign bit value, an exponent field of all 
one bits, and a non-zero significand vdxh the most significant bit set. 

Half-orecision F.'oatina-ooint 

Terpsichore half precision uses a format similar to standard 75-4's requirements, 
reduced to a 16-bii overall format. The format contains sufficient precision and 
exponent range to hold a 12-bit signed integer. 

15 14 10 9 0 

|sign|exponent| significand | 

1 5 10 

Sinole-orecision Floating -point 

Terpsichore single precision satisfies standard 754's requirements for "single." 
31 30 23 22 0 

tsignl exponent | significand j 

1 8 23 

Double-orecision Floating- point 

Terpsichore double precision satisfies standard 754's requirements for "double." 
63 62 52 51 32 

|sign| exponent \ significand5i..32 I 

1 11 20 



3J 

I significand3i..o 



32 



55 



Case 2:05-cv-00505-TJW Document 149 Filed 10/15/2007 Page 19 of 40 

wo 97/07450 PCT/US96a3047 

Ckjsd-orecisinn R natina-ooint 

Terpsichore quad precision satisfies standard 754*s requirements for "double 
extended," but has additional significand precision to use 128 bits. 



127 126 


112 111 


96 




exponent | significandi 1 1 ..96 


1 


1 


15 16 




95 




64 


1 


significand95..64 


1 




32 




63 




32 


1 


significand63..32 


1 




32 




31 




0 


1 


significand3i..o 


1 


32 



56 



Case 2:05-cv-00505-TJW Document 149 Filed 10/15/2007 Page 20 of 40 

wo 97/07450 PCT/US9M3047 



Euteroe Processor 



MicroUmtys Euterpe processor provides the general-purpose, high-bandwidth 
computation capability of the Terpsichore system. Euterpe includes high- 
bandwidth data paths, register files, and a memory hierarchy. Euterpe's memory 
hierarchy includes on-chip instruction and data memories, 'instruction and data 
caches, a virtual memory facility, and interfaces to e.xternal devices Euterpe's 
interfaces include flash ROM, synchronous DRAM. Cerberus serial bus. Hermes 
high-bandwidth channels, and simple keyboard and display. 

Architectural FrRmRwnrk 

The Euterpe architecture builds upon MicroUnitv's Hermes high-bandwidth 
channel architecture and upon MicroUnitv s Cerberus serial bus architecture 
and complies with the requirements of Hermes and Cerberus. Euterpe uses 
parameters A and W as defined by Hermes. 

The Euterpe architecture defines a compatible framework for a family of 
implementations with a range of capabilities. The following implementation- 
dehned parameters are used in the rest of the document in boldface. The value 

mdirarpn ic for MirrrkT TniMr'e ft*»ef Ci«fA^*v^ : ' 



Param 
eter 


Interpretation 


Value 


Range of legal values 


T 


number of execution threads 


5 


1 <T<31 


1 


support for Icarus 


0 


0<l < 1 


i 


iog2 Hermes words per 
interleave block 


1 


0^i£ 1 


H 


number of Hermes channels 


2 


0<H£ 15 


Ce 


instruction SRAM can be all 
cache 


0 


0 < Ce < 1 


Cb 


instruction SRAM can be all 
buffer 


1 


0 ^ Cb ^ 1 


c 


log2 cache blocks in instruction 
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Interfaces and Block Diaomm 
Instruction 

Instruction Mnemonics 

Instruction mnemonics are usually written with periods (.) separating elements of 
the mnemonic to make them easier to understand. Terpsichore assemblers and 
other code tools treat these periods as optional; the mnemonics are designed to be 
parsed uniquely either with or without the periods. 

Instruction ^Structure 

A Terpsichore instruction is specifically defined as a four-byte structure with the 
little-endian ordering shown below. It is different from the quadlet defined above 
because the placement of instructions into memory must be independent of the 
byte ordenng used for data structures. Instructions must be aligned on four-bue 
boundaries: in the diagram below, i must be a multiple of 4. 

3j 24 23 16 15 87 Q 

I byte i-h3 | byte i+2 j byte | " bytel 1 

8 8 8 8 

Gateway 

A Terpsichore gateway is specifically defined as a 16-byte structure with the litde- 
endian ordering shown below. A gateway contains a code address and a data 
address used to invoke a procedure at a higher privilege level securely. Gateways 
are marked by protection information specified in the TLB. Gateways must be 
ahgned on 16-byte boundaries, that is, in the diagram below, i must be a multiple 
of 16. 
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The gateway contains two data items within its structure, a code address and 
data address: 

64 

data address 

"""^ 64 

63 



[ 



code address 



64 : 

User state 

The user state consists of hardware data structures that are accessible to all 
conventional compiled code. The Terpsichore user state is designed to be as 
regular as possible, and consists only of the general registers, the program 
counter, and virtual memory. There are no specialized registers for condition 
codes, operating modes, rounding modes, integer multiplv/divide, or floatina-point 
values. '^^ 

General ReaistRrf^ 

Terpsichore user state includes 64 general registers. All are identical; there is no 
dedicated zero-valued register,, and there are no dedicated floating-point registers. 
63 o" 



REG[0] 



REG[13 



REG[2] 



REG[621 



REGC63] 



64 



Definition 

def val RegRead(rn. size) 
case size of 
64: 

val <- REG[rnl 

128: 

if rno then 

raise Reservedlnstruction 

endif 

val <- REG(m+1] II REG[mJ 

endcase 
enddef 
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def RegWrite(rn. size, val) 
case size of 
64: 

REG[rn] valea o 

128: 

if rno then 

raise Reservedlnstruction 

endif 

REG[rn+l) 4~ vali27..D4 
REGfrn] ^ val63.,0 

endcase 
enddef 

ProarBm Cnuntf^r 

The program counter contains the address of the currently executing instruction 
Ihis register IS imphcitly manipulated by branch instructions, and read bv branch 
instructions that save a return address in a general register. 
63 



I ProgramCounter 



2 10 



62 

PrivileoR Lrv^^I 



The privilege level register contains the privilege level of the currently executing 
instruction. This register is implicitly manipulated by branch gateway and branch 
down mstrucaons. and read by branch gateway instructions that save a return 
address m a general register. 



1 0 

Pl 

1 



Program Counter f^nd Phvileapi I e^^y/P^i 

The program counter and privilege level may be packed into a single ocdet. This 
combined data structure is saved by the Branch Gateway instruction and restored 
by the Branch Down instruction. 



2 10 

I ProgramCounter 



System state 



The system state consists of the facilities not normally used by conventional 
compiled code. These facilities provide mechanisms to execute such code in a fully 
virtiial environment. All system state is memory mapped, so that it can be 
manipulated by compiled code. 
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Fixed-Doint 

Terpsichore provides load and store insiruccions lo move data between memory 
and the registers, branch instructions to compare the contents of registers and to 
transfer control from one code address to another, and arithmetic operations to 
perform computation on the contents of registers, returning the result to registers. 

LQ^d ^nd Stor^ 

The load and store instructions move data between memory and the registers. 
When loading data from memory into a register, values are zero-extended or sign- 
extended to fill the register. When storing data from a register into memory, 
values are truncated on the left to fit the specified memory region. 

Load and store instructions that specify a memory region of more than one byre 
may use either little-endian or big-endian b:.-e ordering: the size and ordering are 
explicidy specified in the instruction. Regions larger than one byte may be either 
aligned to addresses that are an even multiple of the size of the region, or of 
unspecified alignment: alignment checking is also explicitly specified in the 
instruction. 

The load and store instructions are used for fixed-point data as well as floating- 
point and digital signal processing data; Terpsichore has a single bank of registers 
for all data types. 

Swap instructions provide multithread and multiprocessor synchronization, using 
indivisable operations: add-and-swap, compare-and-swap, and multiplex-and- 
swap. A store-multiplex operation provides the ability to indivisably write to a 
portion of an octlet. These instructions always operate on aligned octlet data, using 
either little-endian or big-endian bjrte ordering. 

Branch Conditionsllv 

The fixed-point compare-and-branch instructions provide all arithmetic tests for 
equality and inequality of signed and unsigned fixed-point values. Tests are 
performed either between two operands contained in general registers, or on the 
bitwise and of two operands. Depending on the result of the compare, either a 
branch is taken, or not taken. A taken branch causes an immediate transfer of the 
program counter to the target of the branch, specified by a 12 -bit signed offset 
from the location of the branch instruction. A non-taken branch causes no 
transfer; execution continues with the following instruction. 

Branch Unconditionally 

Other branch instructions provide for unconditional transfer of control to 
addresses too distant to be reached by a 12 -bit offset, and to transfer to a target 
while placing the location following the branch into a register. The branch through 
gateway instruction provides a secure means to access code at a higher privilege 
level, in a form similar to a normal procedure call. 
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Arithmetic Qpf^mtinnci 

The fixed-point arithmetic operations include add. subtract, multiply, divide 
shitts. and set on compare, all using octlet-sized operands. Multiply and divide 
operations produce hexlet results; all other operations produce ocdet results. 

When specified, add. subtract, and shift operations mav cause a fixed-point 
arithnietic exception to occur on resulting conditions such as signed overflow or 
signed or unsigned equality or inequality to zero. 

Addressing OnPir^pnnc: 

A subset of the arithmetic operations are avaUable as addressing operations. 
These addressing operations may be performed at a point in the Euterpe 
processor pipeline so that they may be completed prior to or in conjunction A 
the execution of load and store operations in a "superspring" pipeline in which 
other arithmetic operations are deferred until the completion of load and store 
operations. 

Floatino-Doint 

I^JmpP? provides all the facilities mandated and recommended by 
ANSI/IEEE standard /54-1985: Binary Floating-point Arithmetic, with the use of 
supportmg software. 

Branch Cnnriitinn^lh^/ 

The floating-point compare-and-branch instructions provide all the comparison 
types required and suggested by the IEEE floating-point standard. These floating- 
point comparisons augment the usual types of numeric value comparisons with 
special handhng for NaN (not-a-number) values. A NaN compares as 
unordered \wth respect to any other value, even that of an identical NaN. 

Terpsichore floating-point compare-and-branch instructions do not generate an 
exception on comparisons involving quiet or signaling NaNs; if such exceptions 
are desired, they can be obtained by combining the use of a floating-point 
compare and set instruction, with either a floating-point compare-and-branch 
instruction on the floating-point operands or a fixed-point compare-and-branch on 
the set result. 

Because the less and greater relations are anti-commutative, one of each relation 
that differs from another only by the replacement of an L with a G in the code can 
be removed by reversing the order of the operands and using the other code. 
Thus, a UL relation can be used in place of a UG relation by swapping the 
operands to the compare-and-branch or compare-and-set instruction. 

The E and NE relations can be used to determine the unordered condition of a 
single operand by comparing the operand with itself. 
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The following floating-point compare-and-branch relations are pronded: 
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compare-and-branch relations 



Qomoare-anci'Set 

The floating-point compare-and-set insiructions provide all the comparison types 
supported as compare-arid-branch instructions. Terpsichore floating-point 
compare-and-set instructions may optionally generate an exception on 
comparisons involving quiet or signaling NaNs. 



The following floating-point compare-and-branch relations are provided: 
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Arithmetic Operations 

The operations supponed in hardware are floating-point add, subtract, multiply, 
divide, and square root. Other operations required by the ANSI/IEEE floating' 
point standard are provided by software libraries. 

The operations explicidy specify the precision of the operation, and round the 
result to the specified precision at the conclusion of each operation. 

A single instruction provides a floating-point multiply with the result fed into a 
floatmg.point add. The result is computed as if the multiplv is performed to 
infinite precision, added as if in infinite precision, then rounded. This operation is 
a panicularly good match to the needs of vector linear algebra routines. 

Rounding 

Rounding is specified within the instructions explicitly, to avoid maintaining 
explicit state for a rounding mode. 

Exceptions 

All the mandated floating-point exception conditions cause a trap when they 
occur; maintenance of sticky and other status bits may be performed using 
software routines. Because the floating-point inexact exception may be very 
frequent, this exception only occurs when specified in the instruction explicidy. 
Arithmetic operations may also specify that all exceptions are to be handled by 
default, generating special results instead of traps. 

Digital Signal Processing 

The Terpsichore processor provides a set of operations that maintain the fullest 
possible use of 64- and 128-bit data paths when operating on lower-precision 
fixed-point or floating-point vector values. These operations are useful for several 
application areas, including digital signal processing, image processing, and 
synthetic graphics. The basic goal of these operations is to accelerate the 
performance of algorithms that exhibit the following characterisdcs: 

Low-orecision sththmeHc 

The operands and intermediate results are fixed-point values represented in no 
greater than 64 bit precision. For floating-point arithmetic, operands and 
intermediate results are of 16, 32, or 64 bit precision. 

The use of fixed-point arithmetic permits various forms of operation reordering 
that are not permitted in floating-point arithmetic. Specifically, commurativity and 
associativity, and distribution identities can be used to reorder operations. 
Compilers can evaluate operations to determine what intermediate precision is 
required to get the specified arithmetic result. 
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Terpsichore supports several levels of precision, as well as operations to convert 
betu'een these different levels. These precision levels are alwavs powers of two 
and are explicidy specified in the operation code. 

Sequential accsFis to data 

The algorithms are or can be expressed as operations on sequentially ordered 
Items m memory. Scatter-gather memory- access or sparse-matrix techniques are 
not required. 

Where an index variable is used with a multiplier, such multipUers must be 
powers of two. When the index is of the form: nx+k, the value of n must be a 
power of two, and the values referenced should have k include the majority of 
values in the range 0..n-l. A negative multiplier may also be used. 

Vectorizable nnemt'nn^^ 

The operations performed on these sequentially ordered items are identical and 
independent. Conditional operations are either rewritten to use boolean variables 
or masking, or the compiler is permitted to convert the code into such a form. 

Data-handlinn noeratinm^i 

The characteristics of these algorithms include sequential access to data, which 
permit the use of the normal load and store operations to reference the data. 
Octlet and hesdet loads and stores reference several sequential items of data, the 
number depending on the operand precision. 

The discussion of these operations is independent of byte ordering, though the 
ordering of bit fields within octlets and hexlets must be consistent with the 
ordering used for bytes. Specifically, if big-endian byte ordering is used for the 
loads and stores, the figures below should assume that index values increase from 
left to right, and for little-endian hyzc ordering, the index values increase from 
right to left. For this reason, the figures indicate different index values with 
different shades, rather than numbering. 

When an index of the nx-i-k form is used in array operands, where n is a power of 
2, data memory sequentially loaded contains elements useful for separate 
operands. The "deal" instruction divides a hexlet of data up into two ocdets, with 
alternate bit fields of the source hexlet grouped together into the two results. For 
example, a G.DEAL.16 operation rearranges the source hexlet into two octlets as 
follows:' 



'An example of the use of a deal can be found in the appendix: Digital Signal Processing 
Applications: Decimation of Monochrome Image or Decimation of Color Image 
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III 1 - 


1 1 








1 1 1 1 


1 


1 


16-bit 2-wav dea 





In the deal operarion, the source hexler is specified by two octlet registers, and the 
two result octlets are specified as a hexlet register pair. (This sounds backwards, 
and it really is. but it works in practice, because the result is usually used in 
operations that accept octlet operands. Ideally, the source hexlet should be a 
register pair, and the result should be nvo ocdet registers.) 



The example above directly applies to the case where n is 2. When n is larger, a 
series of DEAL operations can be used to further subdivide the sequential stream. 
For example, when n is 4, we need to deal out 4 sets of doublet operands, as shown 
in the figure below:^ 




16'bit 4-way dea 



This 4-way deal is performed by dealing out 2 sets of quadlet operands, and then 
dealing each of them out into 2 sets of doublet operands. 



^An example of the use of a four-way deal can be found in the appendix: Digital Signal 
Processing Applications: Conversion of Color to Monochrome 
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16-bit 4-way deal 




There are three rows of arrows shown above. The first row is the result of two 
G.DEAL.32 operations, each independently dealing 2 sets of pairs of doublets. 
The result of these two operations is the second row of boxes. The last row is the 
result of two independent G.DEAL.16 operations, each dealing 2 sets of doublets 
into register pairs. The middle row of arrows shows the implicit action performed 
by specifying two non-adjacent registers for the hexlet sources of the G.DEAL.16 
operations. 

When an array result of computation is accessed with an index of the form nx-f k, 
for n a power of 2, the reverse of the "deal" operation needs to be performed on 
vectors of results to interleave them for storage in sequential order. The "shuffle'* 
operation interleaves the bit fields of two octlets of results into a single hexlet. For 
example a G.SHUFFLE.16 operation combines two octlets of doublet fields into a 
hexlet as follows: 
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For larger values of n. a series of shuffle operations can be used to combine 
additional sets of fields, similarly to the mechanism used for the deal operations. 
For example, when n is 4, we need to shuffle up 4 sets of doublet operands, as 
shown in the figure below:' 
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1 1 1 1 1 1 1 i 












\m 1 1 i^n^i 1 1 


^ \\mA 1 1 1 


16-bit 4-wav shuff e 



This 4-way shuffle is performed by shuffling up 2 sets of doublet operands, and 
then shuffling each of them up as 2 sets of quadlei operands. 



^An example of the use of a four-way shuffle can be found in the appendix: Digital Signal 
Processing Applications: Conversion of Monochrome to Color 
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16-bit 4-way shuft e 




EI 



? ImfrxTT °^.«rows shown above. The first row is the result of two 

!r ul i °Pe""ons, each independently shuffling 2 sets of pairs of 
doublets. The result of these two operations is the second row of boxes The last 
row IS the result of t%vo independent G.SHUFFLE.32 operations, each shuffling 2 
sets of quadlets into register pairs. The middle row of arrows shows the implicit 
action performed by specifying two non-adjacent registers for the two octlei 
sources of the G.SHUFFLE.32 operations. 

When the index of a source array operand or a destmation arrav resuh is negated 
or m other words, if of the form nx+k where n is negative, the elements of the 
array must be arranged in reverse order. The "swap" operation reverses the order 
of the bit fields in a hexlet. For example, a G.SWAP.16 operation reverses the 
doublets withm a hexlet: 
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Variations of the deal and shuffle operations are also useful for converting from 
one precision to another. This may be required if one operand is represented in a 
different precision than another operand or the result, or if computation must be 
performed with intermediate precision greater than that of the operands, such as 
when using an integer multiply. 



When converting from a higher precision to a lower precision, specifically when 
halving the precision of a hexlet of bit fields, half of the data must be discarded, 
and the bit fields packed together. The "compress" operation is a variant of the 
"deal" operation, in which the operand is a hexlet, and the result is an ocdet. An 
arbitrary half-sized sub-field of each bit field can be selected to appear in the 
result. For example, a selection of bits 19..4 of each quadlet in a hexlet is 
performed by the G.COMPRESS.16,4 operation: 





1 


1 












Compress 32 bits to 16. with 4-bit right shift 



When converting from lower-precision to higher-precision, specifically when 
doubling the precision of an octlet of bit fields, one of several techniques can be 
used, either multiply, expand, or shuffle. Each has certain useful properties. In 
the discussion below, m is the precision of the source operand. 



The multiply operation, described in detail below, automatically doubles the 
precision of the result, so multiplication by a constant vector will simultaneously 
double the precision of the operand and multiply by a constant that can be 
represented in m bits. 
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An operand can be doubled in precision and shifted left with the "expand" 
operation Nvhichis essentially the reverse of the "compress" operation. For 
example the G.EXPAND.16,4 expands trom 16 bits to 32, and shifts 4 bits left: 



I 



en 



n 



Expand 16 bits to 32. with 4-bit left shift 




The ^'shuffle" operation can double the precision of an operand and multiply it by 
1 (unsigned only), 2"^ or 2"^+!, by specifying the sources of the shuffle operation to 
be a zeroed register and the source operand, the source operand and zero, or both 
to be the source operand. When multiplying by 2m. a constant can be freelv 
added to the source operand by specifying the constant as the right operand to 
the shuffle. 



Arithmetic OoGrRtinn<=; 

The characteristics of the algorithms that affect the arithmetic operations most 
directly are low-precision arithmetic, and vectorizable operations. The fixed-point 
arithmetic operations provided are most of the functions provided in the standard 
integer unit, except for those that check conditions. These functions include add, 
subtract, bitwise boolean operations, shift, set on condition, and multiply, in forms 
that take packed sets of bit fields of a specified size as operands. The floating-point 
arithmetic operations provided are as complete as the scalar floating-point 
arithmetic set. The result is generally a packed set of bit fields of the same size as 
the operands, except that the fixed-point multiply function intrinsically doubles 
the precision of the bit field. 

Conditional operations are provided only in the sense that the set on condition 
operations can be used to construct bit masks that can select between alternate 
vector expressions, using the bitwise boolean operations. All instructions operate 
over the entire ocdet or hexlet operands, and produce a hexlet result. The sizes of 
the bit fields supported are always powers of two. 

Galois Field Operations 

Terpsichore provides a general software solution to the most common operations 
required for Galois Field arithmetic. The instruction provided is a polynomial 
divide, with the polynomial specified as one register operand. The result of a 
specified number of division steps, expressed as a register pair, is the result of the 
instruction. This instruction can be used to perform CRC generation and 
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checking. Reed-Solomon code generation and checking, and spread-spectrum 
encoding and decoding. 

Software Conventions 

The following section describes software conventions which are to be employed at 
sofware module, boundaries, in order to permit the combination of separately 
compUed code and to provide standard interfaces between application, library 
and system software. Register usage and procedure call conventions mav be 
modified, simp ified or optimized when a single compUation encloses procedures 
withm a compUation unit so that the procedures have no external interfaces. For 
example, internal procedures may permit a greater number of register-passed 
parameters, or have registers allocated to avoid the need to save registers at 
procedure boundaries, or may use a single stack or data pointer allocation to 
suffice for more than one level of procedure call. 

Reaistsr Usaae 

All Terpsichore registers are identical and general-purpose: there is no dedicated 
zero- valued register, and no dedicated floating-point registers. However, some 
procedure-call-oriented instructions imply usage of registe'rs zero (0) and one (1) 
in a manner consistent with the conventions described below. By software 
convennon, the non-specific general registers are used in more specific ways. 
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Ac a procedure call boundary, registers are saved either by the caller or callee 
procedure, which provides a mechanism for leaf procedures to avoid needing to 
save registers. Optimizers may choose to allocate variables into caller or callee 
saved registers depending on how their lifetimes overlap with procedure calls. 

The dp register points to a small (4 kilobyte) array of pointers, literals, and 
statically-allocated variables, which is used locally to the procedure. The uses of 
the dp register are similar to the Mips use of the gp register, except that each 
procedure may have a different value, which expands the space addressable by 
small offsets from this pointer. This is an important distinction, as the offset field of 
Terpsichore load and store instructions are only 12 bits. The compiler may use 
additional registers and/or indirect pointers to address larger regions. 
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This mechanism also permits code to be shared, with each static instance of the 
dp region assigned to a different address in memory. In conjunction with position- 
independent or pc-relaiive branches, this allows library code to be dvnamicallv 
relocated and shared between processes. 

Procedurfi O^^llino Conventions 

Procedure parameters are normally allocated in registers, starting from register 2 
up to register 9. These registers hold up to 8 parameters, which mav each be of 
any size from one byte to eight bytes, including single-precision and double- 
precision floating-point parameters. Quad-precision floating-point parameters 
require an aligned pair of registers. The C varargs.h or stdarg.h conventions may 
require saving registers into memory (this is not necessarily so, but some semi- 
portable semi-conventions such as _doprnt would break otherwise). Procedure 
return values are also allocated in registers, starting from register 2 up to register 

There are several data structures maintained in registers for the procedure caUini^ 
conventions: luik, sp, dp. fp. The link register contains the address to which the 
callee should return to at the conclusion of the procedure. 

The sp register is used to form addresses to save parameter and other registers 
mairitam local variables, i.e., data that is allocated as a LIFO stack. For procedures 
that require a stack, normally a single allocation is performed, which allocates 
space for input parameters, local variables, saved registers, and output 
parameters all at once. The sp register is always 16-byte aligned. 

The dp register is used to address pointers, literals and static variables for the 
procedure. The newpc register is loaded with the entry point of the procedure, 
and the newdp register is loaded with the value of the dp register required for the 
procedure. This mechanism provides for dynamic linking, by initially filling in the 
link and dp fields in the data structure to point to the dynamic Hnker. The Imker 
can use the current contents of the link and/or dp registers to determine the 
identity of the caller and callee. to find the location to fill in the pointers and 
resume execution. 

The fp register is used to address the stack frame when the stack size varies 
during execution of a procedure, such as when using the GNU alloca function. 
When the stack size can be determined at compile time, the sp register is used to 
address the stack frame and the fp register may be used for other general 
purposes as a callee-saved register. 

Typical static-linked, intra-moduie calling sequence: 

caller or callee (non-leaf): 

A. ADDi sp.-size 
S.64 link.off(sp) 
... (using same dp as caller) 

B. LINK.I callee 

L.64 link.off(sp) 
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A.ADDI sp.size 
B link 

callee (leaf): 

... (using dp) 
B link 
Typical dynamic-linked, inter>module calling sequence: 

caller or callee (non-leaf): 

A. ADDI sp.-size 
S-128 linkdp.off(sp) 
... (using dp) 

L.I 28 linkdp.off(dp) 

B. LINK link.link 

L.I 28 linkdp.off(sp) 
... (using dp) 

A.ADDI sp.size 

B link 

callee (leaf): 

... (using dp) 

B link 

The load instruction is required in the caller following the procedure call to 
restore the dp register. A second load instruction also restores the link register, 
which may be located at any point between the last procedure call and the branch 
mstruction which returns from the procedure. 

System and Privil^ed Lihmr\/ 

It is an objective to make calls to system facilities and privileged Ubraries as similar 
as possib e to normal procedure caUs as described above. Rather than invoke 
system caUs as an exception, which involves significant latency and complication 
we prefer to use a modified procedure call in which the process privUege level is 
quietly raised to the required level. In to provide this mechanism safely, 
interaction with the vmual memory system is required. 

Such a routine must not be entered from anywhere other than its legitimate entrv 
point, otherwise the sudden access to a higher privUege level might be taken 
advantage of. The branch-gateway instruction ensures both that the procedure is 
invoked at a proper entry point, and that the data pointer is properly set. To 
ensure this, the branch-through-gateway instruction retrieves a "gateway" 
directly from the protected virtual memory space. The gateway contains the 
virtual address of the entry point of the procedure and the virtual address to be 
loaded mto the data pointer. A gateway can only exist in regions of the virtual 
address space designated to contain them, to ensure that a gateway cannot be 
forged. 
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GATE 




CODE 



Gateway with pointers to code and data spaces 

Similarly, a return from a system or privileged routine involves a reduction of 
privilege. This need not be carefully controlled by architectural facilities, so a 
procedure may freely branch to a less-privileged code address. However, in 
certain, though perhaps rare, cases, it would be useful to have highly privileged 
code caU less-privileged routines. As an example, a user may request that errors in 
a privileged routine be reported by invoking a user-suppHed error-logging routine. 
In such a case, a return from a procedure actually requires an increase in 
privilege, which must be carefully controlled. Again, a branch-through-gateway 
instruction can be used, this time in the instruction following the call, to raise the 
privilege again in a controlled fashion. In such a case, special care must be taken 
to ensure that the less-privileged routine is not permitted to gain unauthorized 
access by corruption of the stack or saved registers, such as by saving all registers 
and setting up a new stack frame that may be manipulated by the less -privileged 
routine. 

Typical dynamic-linked, inter-gateway calling sequence: 

caller: 

S.I 28 linkdp.off(sp) 

B.GATE linkdp.off{dp) 
L.I 28 linkdp.off(sp) 
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callee (non-leaf): 



S.64 
L.64 
S.I 28 



sp.off(dp) 
sp.off(dp) 
link.dp.off(sp) 



... ^u^i 

LI 28 
L.64 



. (using dp) 



B.DOWN. 



link.dp.off(sp) 

sp.off(dp) 

link 



callee (leaf): 



... (using dp) 
B.DOWN 



iink 



The callee if it uses a stack for local variable allocation, cannot necessarily trust 
the value of the sp passed to it, except as a region to receive parameters held in 



Pipeline Oroanizatinn 

Euterpe performs all instructions as if executed one-bv-one. in-order, with precise 
exceptions always available. Consequently, code which ignores the subsequent 
discussion of Euterpe pipeline implementations will still perform correctlv 
However, the highest performance of the Euterpe processor is achieved only bv 
matching the ordering of instructions to the characteristics of the pipeline. In the 
foUowing discussion, the general characteristics of all Euterpe implementations 
preceeds discussion of specific choices for specific implementations. 

Classical PioelinR Structum.'^ 

Pipelining in general refers to hardware structures that overlap various stages of 
execution of a series of instructions so that the time required to perform the series 
of instructions is less than the sum of the rimes required to perform each of the 
instructions separately. Additionally, pipelines carry to connotation of a collection 
of hardware structures which have a simple ordering and where each structure 
performs a specialized function. 



memory. 
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The diagram below shows the timing of what has become a canonical pipeline 
structure tor a simple RISC processor, with time on the horizontal axis increasing 
to the right, and successive instructions on the vertical axis going downward. The 
stages I R. E, M, and W refer to units which perform instruction fetch, register 
hie fetch, execution, data memory fetch, and register fHe write. The stages are 
aligned so that the result of the execution of an instruction may be used as the 
source of the execution of an immediately foUowing instruction, as seen bv the fact 
that the end of an E stage (bold in line 1) lines up u-ith the beginning of the E stage 
(bold m hne 2) immediately below. Also, it can be seen that the result of a load 
operation executing in stages E and M (bold in line 3) is not avaUable in the 
immediately foUowmg instruction (line 4), but mav be used lu-o cycles later (line 5)- 
this IS the cause of the load delay slot seen on some RISC processors 



1 1 R i 


E 


M 


W 






1 • 


R 


E 


M 


W 






1 


R 


E 


M 


W 1 



R 



M 



I 



E I M i W 



canonical pipeline 



] 



In the diagrams below we simplify the diagrams somewhat by eliminating the pipe 
stages for instrucuon fetch, register fUe fetch, and register fHe write, which can be 
understood to preceed and follow the portions of the pipelines diagrammed. The 
diagram above is shown again in this new format, showing that the canonical 
pipeline has very litde overlap of the actual execution of instructions 
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A superscalar pipeline is one capable of simulatenously issuing two or more 
instructions which are independent, in that they can be executed in either order 
and separately, producing the same result as if they were executed serially. The 
diagarm below shows a two-way superscalara processor, where one instruction 
may be a register-to-register operation (using atage E) and the other may be a 
register-to-register operation (using atage A) or a memory load or store (using 
stages A and M). 
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A superpipelined pipeline is one capable is issuing simple instructions frequently 
enough that the result of a simple instruction must be independent of the 
immediately folloumg one or more instructions. The diagram belou- shows a iwo- 
cycle superpipehned implementation: 

M E I M I 
^ { E I M I 
^ I E j M 1 



L. E I M 

E i M I 



I E I M 



i superscalar pipeline 



In the diagrams below, pipeline stages are labelled with the type of instruction 
which may be pertormed by that stage. The posititon of the stage further identifies 
the function of that stage, as for example a load operation mav require several L 
stages to complete the instruction. 

Suosrstrina Pinf?linR 

Euterpe architecture provides for implementations designed to fetch and execute 
several mstructions in each clock cycle. For a particular ordering or instruction 
types, one instruction of each type may be issued in a single clock cycle The 
ordering required is A, L, E, S, B; in other words, a register-to-reaister address 
calculation, a memory load, a register-to-register data calculation, a memory store, 
and a branch. Because of the organization of the pipeline, each of these 
instructions may be serially dependent. Instructions of type E include the fixed- 
point ' execute-phase instructions as well as floating-point and digital signal 
processing instructions. We call this form of pipeline organization "superstring,"-* 
because of the ability to issue a string of dependent instructions in a single dock 
cycle, as distinguished from superscalar or superpipelined organizations, which 
can only issue sets of independent instructions. 



•♦Readers w-iih a background in theoretical phwics may have seen this term in an other, 
unrelated, context. 
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These instrucuons take from one to four cycles of latency to execute, and a branch 
prediction mechanism is used to keep the pipeline filled. The diagram below 
shows a box for the interval between issue of each instruction and the completion 
Bold letters mark the critical latency paths of the instrucnons. that is the periods 
between the required availability ot the source registers and the earliest 
availabihty of the result registers. The A-L critical latency path is a special case, in 
which the result of the A instruction may be used as the base register of the L 
instruction without penalty. E instructions may require additional cycles of latency- 
tor certam operations, such as fi.xed-point muluply and divide, floating-point and 
digital signal processing operations. 
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SuoersDrina PioRlinfi 

Euterpe architecture provides an additional refinement to the organization 
defined above, in which the time permitted by the pipeline to service load 
operations may be flexibly extended. Thus, the front of the pipehne, in which A L 
and B type mstrucrions are handled, is decoupled from the back of the pipeline,' in 
which E, and S type instructions are handled. This decoupling occurs at the point 
at which the data cache and its backing memory are referenced; similarly, a FIFO 
that 15 filled by the instruction fetch unit decouples instruction cache references 
from the front of the pipeline shown above. The depth of the FIFO structures is 
implementation-dependent, i.e. not fixed by the architecture. 
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The diagram^ below indicaces why wc call this pipeline organization feature 
superspnng. an extension of our supcrstring organization. 



s:/ Instruction 



fetch unit 



Data 
fetch 
unit 



Execution 
unit 



Superspring pipeline 



11 ^"P"-5P""8 organization, rhe latency of load instructions can be 
extended, so execute instructions are deferred until the results of the load are 
available Nevertheless, the execution unit still processes instructions in normal 
order, and provides precise exceptions. 
1 




Superspring pipeline 



Suoerthread Pine^line^ 

A difficulty of superpipelining is that dependent operations must be separated bv 
the latency of the pipeline, and for highly pipelined machines, the latencv of 
simple operations can be quite significant. The Eutepe "supenhread" pipehne 
provides for very highly pipelined implementations bv alternating execution of two 
or more independent threads. In this context, a thread is the state required to 
maintain an independent execution; the architectural state required is that of the 
register file contents, program counter, privilege level, local TLB, and when 
required, exception status. The latter state, exception status, may be minimized 
by ensunng that only one thread may handle an exception at one time. In order to 
ensure that all threads make reasonable forward progress, several of the machine 
resources must be scheduled fairly. 
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An example of a resource rhat is critical that it be fairly shared is the data 
metnory/cache subsystem. In MicroUnity's first Euterpe implementation, Euterpe 
IS able to perform a load operation only on every second cycle, and a store 
operation only on every fourth cycle. Euterpe schedules these fixed timing 
resources fairly by usmg a round-robin schedule of a number of threads which is 
relatively prime to the resource reuse rates. For Euterpe's first implementation 
fave simulateous threads of execution ensure that resources which may be used 
every two or four cydes are fairly shared by allowmg the instructions which use 
those resources to be issued only on even,- second or fourth issue slot for that 
thread. 

In the diagram below, the thread number which issues an instruction is indicated 
on each clock cycle, and below it, a list of which functional units mav be used by 
that instrucuon. The diagram repeats every 20 cvcles, so cvcle 20 is similar to cycle 
0. cycle 21 is similar to cycle 1, etc. This schedule ensures that no resource conflict 
occur between threads for these resources. Thread 0 mav issue an E,L,S or B on 
cycle 0, but on its next opportunity, cycle 5. may only issue E or B, and on cycle 10 
may issue E, L or B, and on cycle 15, may issue E or B. 
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When seen from the perspective of an individual thread, the resource use 
diagram looks similar to that of the coUecrion. Thus an individual thread may use 
the load unit every tu'o instructions, and the store unit every four instructions. 
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A Euterpe Superthread pipeline, with 5 simulatenous threads of execution, 
permits simple operations, such as register-to-register add (E.ADD), to take 5 
cycles to complete, alloudng for an extremely deeply pipelined implementation. 

Branch/fetch Prediction 

Euterpe docs not have delayed branch instructions, and so relies upon branch or 
fetch prediction to keep the pipeline full around unconditional and conditional 
branch instructions. In the simplest form of branch prediction, as in Euterpe's 
first implementation, a taken conditional backward (toward a lower address) 
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branch predicts that a future execution of the same branch will be taken More 
elaborate prediction may cache the source and target addresses of multiple 
orancnes, both condmonai and unrondirinnal U«»k t i i ^ 



. — t— ''.'."■-7 »wui<.«: ana target addresses ol mu 

branches, both condmonai and unconditional, and both forward and reverse. 

The hardware prediction mechanism is tuned for optimizing conditional branches 
that close loops or express trequent alternatives, and will generally require 
substantially more cycles when executing conditional branches whose outcome is 
not predommately taken or not-taken. For such cases of unpredictable conditional 



results, the use of code which avoids conditional branches in favor of the use of set 
on compare and multiplex instructions may result in greater performance. 

Where the above technique mav not be applicable, a Euterpe pipeline mav ensure 
that conditional branches which have a smaU positive offset be handled as if the 
branch is always predicted to be not taken, u-ith the recovery of a misprediction 
causing cancellation of the instructions which have already been issued but not 
completed which^ would be skipped over by the taken conditional branch This 
conduional-sk.p optimization is performed by the Euterpe implementation and 
requires no specific architectural feature to access or implement. 

A Euterpe pipdine may also perform "branch-return" optimization, in which a 
branch-and-lmk instruction saves a branch target address which is used to predict 
Si.nll"^^' A i! ""T ^"nch-register instruction. This optimization rnay be 
unplemented with a depth of one (only one return address kept), or as a stack of 
fmite depth, where a branch and link pushes onto the stack, and a branch-register 
pops trom the stack. This optimization can eliminate the misprediction cost of 
simple procedure calls, as the calling branch is susceptible to hardware 
prediction, and the retiirning branch is predictable by the branch-return 
optimization Like the conditional-skip optimization described above, this feature is 
performed by the Euterpe implementation and requires no specific architectural 
feature to access or implement. 

Additiona l Load rdcI ExeoutR Resourof ^!^ 

Studies of the dynamic distribution of Euterpe instructions on various benchmark 
suites indicate that the most frequently-issued instruction classes are load 
instructions and execute instructions. In a high-performance Euterpe 
implementation, it is advantageous to consider execution pipelines in which the 
ability to target the machine resources toward issuing load and execute 
mstructions is increased. 

One of the means to increase the ability to issue execute-class instructions is to 
provide the means to issue two execute instructions in a single-issue string. The 
execution unit actually requires several distinct resources, so by partitioning these 
resources, the issue capability can be increased without increasing the number of 
functional units, other than the increased register file read and write ports. The 
partitioning favored for the initial implementation places all instructions that 
mvolve shifting and shuffling in one execution unit, and all instructions that 
involve multiphcacion, including fixed-point and floating-point multiply and add in 
another unit. Resources used for implementing add, subtract, and bitwise logical 
operations may be duplicated, being modest in size compared to the shift and 
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multiply units, or shared between the two units, as the operations have low- 
enough latency that two operations might be pipelined within a single issue cycle. 
These instructions must generally be independent, except perhaps chat two 
simple add, subtract, or bitwise logical may be performed dependendy. if the 
resources for executing simple instructions are shared between the execution 
units. 

One of the means to increase the ability to issue load-class instructions is to 
provide the means to issue two load instructions in a single-issue siring. This 
would generally increase the resources required of the data fetch unit and the 
data cache, but a compensating solution is to steal the resources for the store 
instruction to execute the second load instruction. Thus, a single-issue string can 
then contain either two load instructions, or one load instruction and one store 
instruction, which uses the same register read ports and address computation 
resources as the basic 5-instruction string. This capability also may be employed to 
provide support for unaligned load and store instructions, where a single-issue 
string may contain as an alternative a single unaligned load or store instruction 
which uses the resources of the two load -class units in concert to accomplish the 
unaligned memory operation. 

Result Forwarding 

When temporally adjacent instructions are executed by seperaie resources^ the 
results of the first instruction must generally be forwarded directly to the resource 
used to execute the second instruction, where the result replaces a value which 
may have been fetched from a register file. Such forwarding paths use significant 
resources. A Terpsichore implementation must generally provide forwarding 
resources so that dependencies from earlier instructions within a string are 
immediately forwarded to later instructions, except between a first and second 
execution instruction as described above. In addition, when forwarding results 
from the execution units back to the data fetch unit, additional delav may be 
incurred. 
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Instruction Sf^t 

All instructions are 32 bits in size, and use the high order 8 bits to specify a major 
operation code. ' 



31 24 23 

I major | 



other 



8 . 



24 



The major field is filled with a value specified by the foUoNwing tabie:^ 
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major operation code field values 

For the major operation field values A.MINOR, L.MINOR, E.MINOR F 16 F 32 
^•^'J wr^vvSS-^ ^P-^' ^-^^ G.2, G.4. G.8, G.16, G.32, G.64, S.MINOR 

and B.MINOR, the lowest-order six bits in the instruction specify a minor 
operation code: 

31 24 23 6 5 0 



c 



major 



other 



8 



18 



I minor \ 



5 Blank table entries cause the Rescned Instruction exception to occur. 
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The minor field is filled with a value from one of the folloviing tables: 



£ MINOR 


0 


3 


i i5 


1 2i 


32 


^0 . 46 




\j 


cADCO 


ESUBC 


} EANCn 


1 


EA OD 


ES'Jb ESnuiC 


t cSnRi 




EAOOUO 


E5UBUG 


1 EXOR 


! 


E5HL0 


' eSHLUC j 




2 


EScTL 


E5UBL 


EOR 






i t cShliuG 


1 EUSHRl 


3 


ESEXGc. 


: E3UBGS 


i EAND 




ELMS 


i tULi^S 1 


I 


£ 


E5ETE 


ESUBE 


cCRN 


I 


=ASUM 


1 E5ELECT3 1 £5huFFl=i 


j EPORTi 




E5ETNE 


E5UBNE 


EXNOR 


I 


EPOTt 


E5HL 1 


1 


6 


E5ETUL 


ESUBUt 


ENOR 




E5HR 


1 EUShR 1 ESHLi 


EMSHPI 


7 


cSETUGE 


E5UBUGC 


ENAMD 


1 


£=iCTR 


1 EMSHR ( 


t 




minor operation code field values for E.MINOR 


F.sizs 


•J 






2-: 


52 






0 


FAOD ,N4 


. FADO T 


rACO r 


FAGD Z 


FACD 


! r^oO.X , F5ETE 




\ 


FSua u 


FSU8 T 


FSU8F 


F5UB Z 


F5UB 


! F5U8.X i r5ETNUc 


FSETML'E X 


2 


FMUL.N 


FMUL T 


FMUL.F 


FMUL.C 


FMUL 


1 FMUL.X j FSETNUGc 


F5ETNGE X 


3 


FDIV N 


FDIVT 


FDtVF 


FDIVC 


FDIV 


1 FD»VX i F5ETNUL 


FScTNUL < 


d 


? UNARY N 


• FLTNARYT 


F UNARY F 


F UNARY - 


F UNARY 


1 F UNAhY X . 






- ' • • I , ■- 


c 


'■ i » ' 


T 


I • It: 



minor operation code field values for F:si2e 



GF Size 


•J 


3 


".6 


2^ 32 


*:0 


Ac 


56 


0 


GFADD.N 


GFADD : 


GFADD. F 


GrADu Z GrADD 


GrAOC.X 


GFSETc 




1 


GFSUB N 


GFSUB.T 


GF3UB F 


GF5UBC GFSUB 


GFSUB X 


GFSETNUE 


G=5ETMUE X 


2 


GFMUL N 


: GFMUL. T 


GFMUL.F 


GFMUL C GFMUL 


GFMUL.X 


GFSETNUGEiGFSETMGE X 


3 


GFDIV N 


■ GFDIV T 


GFDIV F 


GFDIV C GFDIV 


GFDIV X 


GF5ETMUL ! GF5ETNUL X 


4 


GF UNARY N 


■ GF UNARY T 


GF.UNAHY F . GF UNARY Z G? UNARY 


GF.JNARY.X 


I 


5 


_. ' » 1 




6 


i I 1 • 




1 


7 


1 1 




1 




minor operation code field values for GF.size 


o size 


0 


° 


16 


24 32 


4C 


46 


bi 


0 




GWUL 


GANDU 


. GADD 


GSuo 


GEXPAMO 

1 


G5nn 
f 


1 




GUMUL 


GXOR 


: 3CCMPRE53 jGlCOMPSESS 


2 


G3ETL 


GOtV 


GOR 


1 


GU£XAPA.N2 


GUSHR 

1 


3 


GSETGE 


GUDIV 


GAND 


i 1 


.1 


4 


GSETE 


GSUB 


GORN 


■ GEXPArjD 


GUEXPANO 


3C0MP=E55; ! GROTR 


5 


GSETNE 1 


GXNOR 


. GROTL 


GSHL 




1 


6 


GSETUL 1 


GNOR 


GSHR 


GUSHR 


GSHL 


GMSHR 


7 


G5ETUGE ! 


GNAND 


1 GROTH 


GMSHR 


.1 


1 




minor operation code field values tor G.size 


L. MINOR 


0 




16 


24 . 52 




46 


DO 


0 


LU16LA 


L16LA 


L64LA 


Lo i 




: 




LU16BA 


L15BA 


L54BA 


LUS 1 




! 


2 


LU16L 


Lt6L 


L64L 


1 






3 


LU1EB 


L^oB 


L5dB 


1 


1 — 


— \ — 




LU32LA 


L32LA 


L128LA 


\ 






5 


LU32BA 


L328A 


L128BA 


1 




. 


6 


LU32L 


L32L 


L128L 


1 




1 


7 


LU32B 


L32B 


L123B 


i 




1 



minor operation code field values for L.fydlNOR 
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For the major operation field values F 16 F 32 F64 F tP« ^.nrU r^;^^ 



31 



24 23 



major 

8 



I 



18 17 



12 1 1 



Other 

' 6 



6 5 



Other I unary \ minor H 



The unary field is fiUed with a value from one of the foUowing tables: 



24 



I 32 



40 



=-SQH 



F.FL0AT 



F.OEFLATE 



unary operation code field values tor F.UNARY.size.r 



24 



32 



^0 



48 



GF NEG 



GF SINK \ 



GF FtOAT 



GF INFLATE ' 



GF. DEFLATE- 



+ 



unary operation code field values for GF.UNARY.size.r 
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The general forms of ihe instructions coded by a major operation code are one of 
the following: 

p . P 0 

I major j offset " 1 

8 ' 24 ' ' 

31 24 23 18 17 ■ 0 

I major j ra | offset I 



8 



18 



31 24 23 18 17 12 11 p 

I major | ra | rb | offset 1 

8 6 6 12 ' 

31 24 23 18 17 12 11 6 5 0 

I major I ra | rb j rc | rd I 

8 6 6 6 6 

The general forms of the instructions coded by major and minor operation codes 
are one oi the following: 

3^ 24 23 18 17 1? 11 65 Q 

I — '"aj*"' i ra I rb I rc i liHl^^n 

8 6 6 6 6 

3J 24 23 18 17 12 n 65 0 

I maior | ra ^ i simm I minor I 

8 6 6 6 6 ' 

The general form of the instructions coded by major, minor, and unarv operation 
codes is the foDowing: 

31 24 23 18 17 12 11 65 Q 

I major I ra | rb | unary | minor | 

8 6 6 6 5 

Definition 

del Thread as 
forever do 

inst <- LoadMemory(ProgramCounter.32.L) 
ProgramCounter <- ProgramCounrer + 4 
Instruction(inst) 
endforever 
enddef 



def Instruction(inst) as 
major ^ inst3i .24 
ra 4- inst23..i8 
simm rb insii7 12 
rc instil .6 



87 



Case 2:05-cv-00505-TJW Document 1 49 Filed 1 0/1 5/2007 Page 1 1 of 40 

wo 97/07450 PCTAJS96/13047 



minor «- rd «- insts. o 
case major of 
E.RES: 

EternallyReserved 
E.MINOR: 

case minor of 

E.ADD^E,ADD.O. E.ADD.UO. E.AND. E.ANDN. E.NAND E NOR 
E.OR E.ORN. E.SUB. E.SUB.O. E.SUB.UO. E.XNOR EXOR 

°- E SHL.UO. E SHR. E.USHR. E.ROTL. E ROTR 
i-^ii-^c^.^^'^'- ^ E UDIV. E LMS. E.ULI^S. E ASUM 
E.SET.E. E.SET.NE. E.SET.L E.SET.GE. E SET UL E SET UGE 
E.SUB.E. E.SUB.fgE. E.SUB.L E.SUB.GE. E.SUBA;l 1 SUB UGE 
Execute(minor,ra.rD.rc) 

E. SHL.I. E.SHLI.O. E.SHL.I.UO. E.SHR.I. E.USHR I E ROTR |- 

ExecuteShortlmmediat9(minor.ra.simm rc) 

others: 

raise Reservedlnstruction 

endcase 

E.ADD.I. E ADD.I.SO. E.AND.I. E.OR. I. E.NAND I E NOR I EXOR I 
E.SET.I E. E.SET.I.NE. E.SETJ.L. ES.ET.I.GE. ESET IUL ESE^^^^^^^^ 
E.SUB.I.E. E.SUB.1.NE. E.SUB.LL E.SUB.LGE E SUB^^^^^^^^ 

execute lmmediate(major .ra.rc .inst * i n) 
EMUX: 

Execute Ternary(maior.ra.rb rc rd^ 

E. COPY.I i • ^ 

ExecuteCopylmmediate{majcrrajnst^7 q) 
FMULADD15. FMULADD32. FMULADD64 FMULADD128 
FMULSUB16. FMULSUB32. FMULSUB64; FMULSUB128' 

FloatingPointTernary{major.ra.rb.rc rd) 

F, 16. F.32. F.64. F.128: ' 

case minor of 

F. ADD.N. F.SUB.N. F.MULN. F.DIV N 
F.ADDJ. F.SUBJ. F.MUL.T. F.DIV T ' 
F.ADD.F. F.SUB.F, F.MUL.F. F DIV F 
F.ADD.C. F.SUB.C. F.MULC F.DIV C 
F.ADD. F.SUB. F.MUL. F.DIV 
F.ADD.X. F.SUB.X. F.MUL.X. F DIV X 
F.SET.E. F.SET.NE. F.SET.UE. F.SET.NUE 
R5ET.NUGE. F.SET.UGE. F.SET.UL F.SET.NUL 
F.SET.E.X. RSET.NE.X. F.SET.UE.X. F.SET.NUE X 
F.SET.L.X. F.SET.NL.X. F.SET.NGE.X. F.SET.GE.x! 

FloatjngPoint{rT)inor.op. major.size. minor.round. ra rb rc) 
F.UNARY.N. F.UNARY.T. F.UNARY.F. F.UNARY C 
F.UNARY. F.UNARY.X: 
case unary ot 

F.ABS. F.NEG. F.SQR. 

F.HALF. F.SINGLE, F.DOUBLE. F.QUAD 

F.INT. F.FLOAT: 

FloatingPointUnary(unary.op. major.size. minor.round. 
ra. rc) 

others: 

raise Reservedlnstruction 

endcase 
others: 

raise Reservedlnstruction 

endcase 

GMULADD1. GMULADD2. GMULADD4 
GMULADD8. GMULADD16. GMULADD32. 
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GUMULADD2. GUMULADD4. 

GUMULADD8. GUMULADD16. GUMULADD32 

GMUX. GMUXGATHER. GSCATTERMUX. G.EXTRACT.128: 

GroupTernary(major.size.ra.rb.rc.rd) 
G.EXTRACT.I. G. EXTRACT J.64: 

GroupExtract!mmediate( major. ra.rb.rc.minor) 
G.I. G.2. G.4. G.8. G.16. G 32: 
case minor of 

G.SHL G.SHR. G.USHR. G ADD. G.SUB. G.MUL G UMUL 
G.AND. G.OR. G XOR. G.ANDN. CNAND. G.NOR G XNOR G ORN 
G.SET.E. G.SET.NE. G.SET.L G.SET.GE. G.SET.UL. G.SET UGE 
G.COPY. G.SWAP. G.DEAL G.SHUFFLE. G. COMPRESS, G EXPAND 
G.GATHER. G.SCATTER: 

"Group(minor.major.ra.rb.rc) 
G.COMPRESS I. G.EXPANDJ. G.SHLI. G.SHR. 1. G.U.SHR.I: 

GroupShQrtlmmediat3(minor. major, ra simm rc) 
G.EXTRACT.I: 

GroupExtracitmmedi3t£(ma)or.ra.rb.rc.minor) 
others: 

raise Reservedinstruciicn 

endcase 

GFMULADD16. GFMULADD32. GFMULADD64 
GFMULSUB16. GFMULSUB32, GFMULSUB64: 

GroupFloatingPointTernary(maicr.ra.rb.rc.rd) 
GR16. GF.32. GF.64. GF.128: 
case minor of 

GF.ADD.N. GF.SUB.N. GF.MUL.N. GF DIV N 
GF.ADDT. GF.SUB.T. GF.MULT. GFDIVT 
GF.ADD.F. GF.SUB.F. GF.MUL.F, GF.DIVF 
GF.ADD.C, GF.SUB.C. GF.MUL.C. GF DIV C 
GF.ADD. GF.SUB. GF.MUL. GF.DIV. 
GF.ADD.X. GF.SUB.X. GF.MULX. GF DIV X 
GF.SET.E. GF.SET.NE. GF.SET.UE. GF.SET.NUE. 
GF.SET.NUGE. GF.SET.UGE. GF.SET.UL. GF.SET.NUL 
GF.SET.EX GF.SET.NE.X. GF.SET.UE.X. GF.SET.NUE X 
GF.SET.L.X. GF.SET.NLX. GF.SET,NGE.X. GF.SET.GE.X: 

GroupFlbatingPoint(minor.op. major.size. minor.round. ra. rb. rc) 
GF.UNARY.N. GF.UNARY.T. GF.UNARY.F. GF.UNARY.C 
GF.UNARY, GF.UNARY.X: 
case unary of 

GF.ABS. GF.NEG. GF.SQR. 

GF.HALF. GF.SINGLE. GF.DOUBLE. GF.QUAD 

GF.INT. GF.FLOAT: 

GroupFloatingPointUnary(unary.op. major.size. 
minor.round. ra. rc) 

others: 

raise Reservedlnstruction 

endcase 
others: 

raise Reservedlnstruction 

endcase 
L.MINOR 

case minor of 

L16L LU16L L32L. LU32L. L64L. L128L. LB. LU8. 
L16LA. LU16LA. L32LA. LU32LA. L641J\. L128LA. 
L16B. LU16B. L32B. LU32B. L54B. L12BB. 
L16BA. LU16BA. L32BA. LU32BA. L64BA. L128BA: 
Load(minor.ra.rb.rc) 
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Others: 

raise Reservedlnstruciion 

endcase 

L16U, LU16LI L32LI. LU32LI. L64LI. L128LI. L81 LU8I 
L16LAI. LU16LAL L32LAI. LU32LAI. L64LAI. L128LAI ' 
L16BI. LU16BI. L32BI. LU32BI. L64B! L128BI 
L16BAI. LU16BAI. L32BAI. LU32BAI. L64BAI. 'l128BAI 

Loacllmmediate(major.ra.rb.insti \ q) 
S. MINOR 

case minor of 

S16L. S32L S64L. S123L SB 

S16LA. S32LA. S64LA. S128LA. 

SAAS64LA. SCAS54LA. SMAS64LA. SM64LA 

S16B. S32B. S64B. S128B 

S16BA. S32BA. S64BA. S128BA 

SAAS64BA. SCAS64BA. SMAS64BA. SM64BA: 

Slore(minor.ra.rb.rc) 
others: 

raise Reservedlnstruction 

endcase 
S16U. S32LI, S64LI. S128LI. SSI 
S16LAI. S32LAI. S64LA1. S123LAI 
SAAS64LAL SCAS64LAI. SMASc4LAI. SM64LAI 
S16BI. S32BI. S64BI. S128BI 
S16BAI. S32BAI. S64BAI. S128BAI 
SAAS64BAI. SCAS64BA1. SMA$64BAI. SM64BAI: 

Slorelmmediate(major.ra.rb.instn q) 
B. MINOR: 

case minor of 
B. B.DOWN: 

Branch(minor.ra) 
B.LINK: 

BranchAndLrnk(minor.ra rb) 
B.GATE: ' 

BranchGateway(rninor.ra.rb.rc) 
others: 

raise Reservedlnstruction 

endcase 
BLINKI. Bl: 

BranchImnnediate(major.inst23 o) 
BFE16. BFNUE16. BFNUGE16. BFNUL16 
BFE32. BFNUE32. BFNUGE32. BFNUL32' 
BFE64. BFNUE64. BFNUGE64. BFNUL64 
BFE128. BFNUE128. BFNUGE128. BFNUL128 
BE. BNE. BL. BGE. BUL BUGE. BANDE. BANDNE: 

BranchConditional(major,insiii q) 
BGATEI: 

BranchGatewaylmmediate(ra.rb.instii.,o) 

others: 

raise Reservedlnstruction 

endcase 
enddef 
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Eternally ResRrvfid 

This operation generates a reserved inscruction exception. 

Ooeration code 

I E.RES . I Eternally reserved "" 

Format 
E.RES imm 



31 24 23 

I E,RES T 



imm 



8 24 ■ 

Descrintinn 

The reserved instrucaon exception is raised. Software may depend upon this 
major operation code raising the reserx'ed instruction exception in aU Terpsichore 
processors. The choice of operation code intentionally ensures that a branch to a 
zeroed memor\» area will raise an exception. 

Definition 

def EternallyReserved as 

raise Reservedlnstruction 
enddef 

Exceptions 
Reserved Instruction 
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Branch 



Oceration codf^ci 








b 




Branch 




B.DOWN 
Format 




Branch down in privilege 


H 


op ra 








r- 


23 


18 17 12 II A 


5 0 


1 B.MINOR 


1 


ra 1 0 1 0 


1 op 1 


8 




6 6 6 


6 


Descriotinn 









Execution branches to the address specified by the contents of register ra If 
specified, the current privilege level is reduced ro the level specifieKv the lo v 
order two bus of the contents of register ra. ^ 

Access disallowed exception occurs if the concerns of register ra is not aligned on a 
quadlet boundary unless the operation specifies the usf of the low-orS two bit 
of the contents of register ra as a privUege level. 

Definition 

def Branch{op.fa) as 

a <- RGgRead(ra. 64) 
if op = B.DOWN then 

if PrivilegeLevel > ai. o then 
PrivilegeLevel <- a^^.o 

endif 

else 

if (a and 3)^0 then 

raise AccessDisallowedByViriualAddress 

endif 

endif 

ProgramCounter <- 353 2 H 0^ 
enddef 

Exceotinnc; 

Access disallowed by virtual address 
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Branch and Link 

This operation branches to a location specified by a register, saving the value of 
the program counter into a register. 

Ooeraticn codes 

IB.LINK I Branch and link | 

Eormm 

op rb.ra 

31 24 23 18 17 12 11 6J 0 

I B.MiNOR I ra I rb I Q j 1 

8 6 6 6 6 

Descriotion 

The address of the instruction following this one is placed into register rb. 
Execution branches to the address specified by the contents of register ra. 

Access disallowed exception occurs if the contents of register ra is not aligned on a 
quadlet boundary. 

Definition 

def BranchAndLink(op.ra.rb) as 
a RegRead(ra. 64) 
if (a and 3) 0 then 

raise AccessDisallowedByVirtualAddress 

endif 

RegWrite(rb. 64. ProgramCounter 4) 
ProgramCounter <- aez .Z *l 0^ 
enddef 

Exceptions 

Access disallowed by vinual address 
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Branch Conditionally 

These operations compare two operands, and depending on the resuh of that 
comparison, condirionally branches to a nearby code location. 



Qoeration codes 



B.AND.E 


Branch and equal to zero 


B.AND.NE 


Branch and not eaual to zero 


B.E° 


Branch equai 


B.F.E.16 


Branch floaiina-ooint pnn?»i h;:^if 


B.F.E.32 


Branch float'nn-nnint poiipI Qinnio 


B.F.E.64 


Rr^nph flnpf-nn-nolnt onnal rirMiKl^i 


B.F.E.128 


Rr?inph fln^if'nn-nnint cmtol r»iiQH 

LJi ai 1 iiUdviiiU PUIIII CL^Ucil LfUdU 


B F NUE 16 


di Mua.ii ly-puii 11 iiui unQioBreo or GQuai nan 


B F NUE 32 


Didiiv^fi iiudjiig-uoinL noi unoraereu or SQuai sinQie 


B F NUE 64 


oidFicfi iiuciui ly-poini noi unoraerGO or GQuai ooudig 


B F NUE 128 


Didiiun noddng-poinr noi unoraered or eQuai Quad 


B F NUGE 16 


LjidiiL-M itudj >juiiii iiui unoruereu yrBaisr or Gouiai naii 


B.F.NUGE.32 


Branch floating-coint not unordered greater or equa: Sinoie 


B.F,NUGE.64 


Branch fioaKng-ooini not unordered greater or eaual ooucie 


B.F.NUGE.128 


Branch fioatinn-comi not unordered greater or equal ouaci 


B.F.NUL.16 


Branch floating-point not unordered or less halt 


B.F.NUL.32 


Branch floating-point not unordered or less smoie 


B.F NUL.64 


Branch floating-point not unordered or less dcucie 


B.F.NUL.128 


Branch floating-point not unordered or less ouaa 


B.G.Z' 


Branch signed greater than zero 


B.GE 


Branch signed greater or equal 


B.GE.Z8 


Branch signed greater or equal to zero 


B.L 


Branch signed less 


B.L.Z9 


Branch signed less than zero 


B.LE.Zio 


Branch signed less or equal to zero 


B.NE" 


Branch not equal 


B.U.GE 


Branch unsigned greater or equal 


B.U.L 


Branch unsigned less 



^B.E suffices for both signed and unsigned comparison for equality. 
'B.G.Z is encoded as B.U.L with boch instruccion fields ra and rb equal. 
^B.GE.Z is encoded as B.GE with boch instruction fields ra and rb equal. 
^B.L.Z is encoded as B.L with boch instruction fields ra and rb equal. 
^*^B.LE.Z is encoded as B.U.GE vith both insiructionf fields ra and rb equal. 
^^B.NE suffices for boch signed and unsigned comparison for inequality. 
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number format 


type 


conriDare 


Size 


signed integer 




E 


NE 


L 


GE 




unsigned integer 


u 


E'2 


NE'3 


L 


GE 




bitwise and 


AND 


E 


NE 








signed integer vs. zero 


Z 


L 


GE 


G 


LE 




floating-point 


F 


E 


NUE 


NUGE 


NUL 


16 
32 
64 

128 


Format 














op rb.ra, target 

31 24 23 


18 '7 


12 


11 




0 




ra 


1 


rb 


1 


offset 


1 


8 


6 




6 




12 





Description 

The contenis of registers or register pairs specified by ra and rb are compared, as 
specified by the op field. If the result of the comparison is true, execution 
branches to the address specified bv the offset field. Otherwise, execution 
continues at the next sequential instruction. 

A reserved instruction exception occurs when the size specified by the op field is 
128 if rao or rbo is set. 

Definition 

def BranchConditional(op,ra.rb.offsel) as 
case op of 

BFE16. BFNUE16. BFNUGEI6. BFNUL16 
BFE32. BFNUE32. BFNUGE32.. BFNUL32. 
BFE64. BFNUE64. BFNUGE64. BFNUL64 
BFE128. BFNUE128. BFNUGE128 BFNUL128: 

type «- F 
BE. BNE: 

type <- NONE 
BL. BGE: 

type f- (ra = rb) ? Z : NONE 
BUL BUGE: 

type ^ (ra = rb) ? Z : U 
BANDE. BANDNE: 

type AND 

endcase 
case op of 
B.U.L: 

compare ^ (ra = rb) ? G : L 

B.U.GE: 



^^B.U.E implemenred as B.E. 
^^B.U.NE implemented as B.NE. 
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compare (ra = rb) ? LE : GE 
B.GE: 

compare GE 

B.L: 

compare <~ L 
B.AND.NE. B.NE: 

compare NE 
B.AND.E. B E. B F.E.I 6. B.F.E.32. B.F.E.54. B.F,E 128- 

compare <- E 
B.F.NUE.16. B.F.NUE.32. B.F.NUE.64. B F.NUE 128- 

compare 4- NUE 
B.F.NUGE-16. B.F.NUGE.32. B.F.NUGE.54. B P NUGE 128 

compare NUGE 
B.F.NUL.16. B.F.NUL.32. B.F,NUL64. B.F.NUL 128* 

compare f- NUL 

endcase 
case op of 

BFE16. BFNUE16. BFNUGE15. BFNUL16: 
size <- 16 

BFE32. BFNUE32. BFNUGE32. BFNUL32: 
size <— 32 

BFE64. BFNUE64. BFNUGE64. BFNUL64: 
size <- 64 

BFE128. BFNUE128. BFNUGE128. BFNUL128- 
size 4- 128 

BE. BNE. BL BGE. BUL. BUGE:. BANDE. BANDNE: 
size <— undefined 

endcase 
case type of 
NONE: 

a ^ RegRead(ra. 64) 

b RegRead(rb. 64) 

l4-b 

r <- a 

U: 

a RegRead(ra. 64) 
b 4- RegRead(rb. 64) 
1 4- 0 II b 
r <- 0 II a 

AND: 

a <- RegRead(ra. 64) 
b ^ RegRead(rb. 64) 
I a and b 
r^O 

Z: 

a <- RegRead(ra, 64) 

l<-a 

r^ 0 

F: 

a RegRead(ra. (size<64) ? 64 : size) 
b 4- RegRead(rb. (size<64) ? 64 : size) 
I «- F(size,b) 
r*- F(si2e.a) 

endcase 

if (type=F) and (isNaN(r) or isNaN{l)) then 
c <- false 

else 

case compare of 
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E: 

c I = r 
NE. NUE: 

c <- 1 9* r 
L. NUGE: 

c <" 1 < r 
NUL. GE: 

C l> r 

G:- 

c «- I > r 

LE: 

c <- I < r 

endcase 

endif 
if c then 

PC PC + (offset 1 1^0 |( offset H 0^) 

endif 
enddef 

Exceptions 
Reserved Instruction 
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Branch Gateway Immediatf^ 

This operation provides a secure means to call a procedure, including those at 
higher privilege level. 



Operation cod^<^ 

I B.GATE.I I Branch gateway immediate 

Format 

B.GATE.I ra.imm 

31 24 23 18 17 12 11 

i B.GATE-I I ra I 0 | 



imm 



12 



8 

Description 

A virtual address is computed from the sum of the contenis of register ra and the 
sign-extended value ot the 12-bit immediate field. The contents of 16 bytes of 
memory using the httle-endian byte order is fetched. A branch and link occurs to 
the iow-order ocdet of the memory data, and the successor to the current program 
counter, catenated with the current execution privilege is placed in register 0 The 
privilege level is set to the contents of the Iow-order two bits of the memory data 
Register 1 is loaded with die high-order ocdet of the memory data. 

An access disaUowed exception occurs if the new privilege level is greater than the 
privilege level required to write the memory data, or if the old privilege level is 
lower than the privilege required to access the memory data as a gateway. 

An access disallowed exception occurs if the target vinual address is a higher 
privilege than the current level and gateway access is not set for the gateway 
virtual address, or if the access is not aligned on a l6-byre boundary. 

A reserved instruction exception occurs if the rb field is non-zero. 

Definition 

del BranchGatewaylmmediate(ra.rb.imm) as 
a RegRead(ra. 64) 

VirtAddr a + (imm^f H 'mm) 

if VirLAddra 0 ^ 0 then 

raise AccessDisallowedByVirtualAddress 

endif 

if rb * 0 then 

raise Reservedlnstruction 

endif 

be- LoadMemory(\/irtAddr.128.L) 

bx <- bi27..6fl I* ProgramCounter63..2^1 11 PrivilegeLeve! 

ProgramCounter <- b63..2 0^ 
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PrivilegeLevel <- b^ Q 
RegWriie(rb. 128. bx) 
enddef 

Exceptions 

Reserved Instruction 

Access disallowed by virtual address 

Access disallowed by tag 

Access disallowed by global TLB 

Access disallowed by local TLB 

Access detail required by tag 

Access detail required by local TLB 

Access detail required by global TLB 

Cache coherence intervention required by tag 

Cache coherence intervention required by local TLB 

Cache coherence intervention required by global TLB 

Local TLB miss 

Global TLB miss 
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Branch Gateway 

This operation provides a secure means to call ^ procedure, including those at a 
higher privilege level. 

Ooeraticn codes 

I B>GATE I Branch gateway | 



Format 
B.GATE ra.rb 

31 24 23 18 17 12 1 1 

I B.GATE I ra i rb "~r 



8 6 6 12 

Descriotion 

A virtual address is computed from the sum of the contents of register ra and 
register rb. The contents of 16 bytes of memory using the little-endian byte order 
is fetched. A branch and link occurs to the low-order ocdet of the memory data, 
and the successor to the current program counter, catenated with the current 
execution privilege is placed in register 0. The privilege level is set to the contents 
of the low-order two bits of the memory data. Register 1 is loaded with the high- 
order ocdet of the memory data. 

An access disallowed exception occurs if the new privilege level is greater than the 
privilege level required to write the memory data, or if the old privilege level is 
lower than the privilege required to access the memory data as a gateway. 

An access disallowed exception occurs if the target virtual address is a higher 
privilege than the current level and gateway access is not set for the gateway 
virtual address^ or if the access is not aligned on a 16'byte boundary. 

A reser\'ed instruction exception occurs if the rb field is non-zero. 

Definition 

def BranchGateway(ra.rb.rc) as 
a <- RegRead(ra. 64) 
b <- RegRead{rb. 64) 
ViriAddr a +b 

if VirtAddr3 0 ^ 0 then 

raise AccessDisaliowedByVirtualAddress 

endif 

if rc :c 0 then 

raise Reservedlnstruction 

endif 

c f- LoadMemory(VirtAddr.128.L) 

cx <-ci27..64 ProgramCounierga 2+1 II PrivilegeLevel 

ProgramCounter <- 053 .2 0^ 
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PrivilegeLevel Ci .o 
RegWrite(rc. 128. cx) 
enddef 

Excaotions 

Resen'ed Instruction 

Access disallowed by virtual address 

Access disallowed by tag 

Access disallowed by global TLB 

Access disallowed by local TLB 

Access detail required by tag 

Access detail required by local TLB 

Access detail required by global TLB 

Cache coherence intervention required by tag 

Cache coherence intervention required by local TLB 

Cache coherence intervention required by global TLB 

Local TLB miss 

Global TLB miss 
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Branch Immediate 

This operation branches to a location that is specified as an offset from the 
program counter^ optionally saving the value of the program counter into register 



Ooeration codes 



B.I 


Branch immediale | 


B.LINK.I 


Branch immediate and link | 



Format 



op target 

31 24 23 0 

I op I offset I 

8 24 ' 

DescriQtion 

If requested, the address of the instruction following this one is placed into 
register 0. Execution branches to the address specified by the offset field. 

Definition 

def Branchlmmediaie(op.offset) as 
if (op = B.LINK.I) then 

RegWrite(0. 64. ProgramCounter + 4) 

endlf 

ProgramCounter <- ProgramCounter + (off setgl 'I offset 110^) 
enddef 

Exceptions 
none 
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These operations perform calculations with two general register values, placintr 
the result in a general register. * " 



Ooeration codes 





Execute add 


t.AUU.O 


Execute add and check signed overflow 


c.ADD.UO 


Execute add and check unsigned overflow 


E.AND 


Execute and 


E.ANDN 


Execute and not 


E.ASUM 


Execute and summation of bits 


E.LMS 


Execute signed loparithm of most significant bit 


F NAND 


cxBcuiG not sno 


E.NOR 


Execute not or 


E.OR 


Execute or 


E.ORN 


Execute or not 


E.ROTL 


Execute rotate left 


E.ROTR 


Execute rotate right 


E.SELECT.8 


Execute select bytes 


t.SHL 


Execute shift left 


E.SHL.0 


txecute shift left and check signed overflow 


E.SHL.UO 


Execute shift left and check unsigned overflow 


E.SHR 


Execute signed shift right 


E.ULMS 


Execute unsigned logarithm of most significant bit 


E.USHR 


Execute unsigned shift right 


E.XNOR 


Execute exclusive nor 


E.XOR 


Execute xor 



class 


operation 


check 


arithmetic 


ADD 


NONE 0 UO 


shift 


SHL 


NONE 0 UO 


SHR USHR 




ROTL ROTR 




SELECTS 




logarithm 


LMS ULMS 




summation 


ASUM 




bitwise 


OR AND XOR ANDN 

NOR NAND XNOR ORN 





Format 



op rc=ra,rb 

31 24 23 18 17 12 11 6 5 0 
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I E.MINOR 



ra 



rc \ op I 



6 



6 



6 6 



Description 

The contents of registers ra and rb are fetched and the specified operauon is 
performed on these operands. The result is placed into register rc. 

Definition 

def Execute(op.ra.rb.rc) as 
a RegRead(ra. 64) 
b i- RegRead(rb. 64) 
case op of 
E.ROTL: 

E.ROTR: 

E.SHL: 

a(63-b5 0).-0 110^-^ 0 

E.SHLO: 

^63..63-b5.o''a^3^'"^ ^hen 
raise FixedPointArithmetic 

endif 

c*- a(63-bso).-0"0^" 
E.SHLUO: 

S53..64-bg_Q * 0 then 

raise FixedPointArithmetic 

endif 

a(63.b5o)..0 "0^5.0 

E.SHR: 



E.ADD: 

c a -I- b 
E.ADD.O: 

l«-(a63«a)4.(b63llb) 

»f l64 * t63 then 

raise FixedPointArithmetic 

endif 

c t63..o 
E.ADD.UO: 

t f- (0^ II a) -f (0^ II b) 
if i64 5* 0 then 

raise FixedPointArithmetic 

endif 

c t63..0 



c a, 



'63° "363.^,5.0 



E.USHR; 
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E.AND: 




c 


a and b 


E.OR: 




c 


a Of b 


E.XOR: 




c 


a xor b: 


E.ANDN: 




c ^ 


a and not b 


E.NANO: 




c <- 


not (a and b) 


E.NOR: 


C <- 


not (a or b) 


E.XNOR: 


c 


not (a xor b) 


E.ORN: 


c 


a or not b 


ELMS: 





if (a=0) then 
c ^ o 

else 

for i <- 0 to 63 



if 363.1 = {a|3"'"nol a£3) then 
c <- i 

endif 
endfor 

endif 
E.ULMS: 

if (a=0) then 
c -1 

else 

for i 4- 0 to 63 

»f a63..i = (0^^' II 1)then 

C *- i 

endif 
endfor 

endif 
E.ASUM: 

1 4- a & b 

^ {t63..i&0x5555555555555555) + (t&0x5555555555555555) 
V (U65..2&0X3333333333333333) + (u&0x3333333333333333) 
w ^ {V63..4&0x707070707070707) + (v&0x07070 70707070707) 

X (wea .e&oxfooofooofooof) + (w&oxooofoootooofooof) 

c ^ + X36..32 + X20..16 X4..0 

E.SELECT.8: 

for i <- 0 to 7 

i b3-,V2..3-j 

C8-i+7..a*i <~ a8*i+7..8"i 
endfor 

endcase 

RegWrite(rc. 64. c) 
enddef 

Exceptions 
Fixed-point arithmetic 
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Execute Coov Immediate 

This operation produces one immediate value, placing the result in a general 
register. 

Ooeraticn nnHf^c^ 

I E.COPY.I I Execute copy immediate J 



Format 

E.COPY.I ra=imm 

31 24 23 18 17 

I E.COPYJ I ra I h^Ti^T 



8 t) 18 

Description 

A 64-bit immediate value is sign-extended from the 18-bic imm field. The result is 
placed into register ra. 

Definition 

dGf ExecuieCopylmmediate(op.ra.imm) as 

i i- (imm -17^6 II imm) 
RegWriie{ra. 64. i) 
enddef 

Exceptions 
none 
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Execute Field Immediate 

These operations perform calculations with one or two general register values and 
two immediate values, placing the result in a general register. 



Operation cod^s 



E.DEP.I 


Execute deposit immediate 


E.MDEP.I 


txecute merge deposit immediate 


E.UDEP.I 


Execute unsigned deposit immediate 


E.UWTH.I 


Execute unsigned withdraw immediate 


EWTH.I 


Execute withdraw immediate 



Format 



op rb=ra.tshiftjsize 

23 18 17 12 11 65 0 

1 I ra I rb I ishift | isizea | 

8 . 6 6 6 6 

Description 

The contents of register ra, and if specified, the contents of register rb is fetched, 
and 6-bit immediate values are taken from the 6-bit ishift and isizea fields. The 
specified operation is performed on these operands. The result is placed into 
register rb. 

Definition 

def ExecuteFieldlmmediate(op.ra.rb. ishift. isizsa) as 
a <- RegRead{fa. 64) 
isize 4- isizea+1 
if (ishift+isize>64) 

raise Reservedlnstruction 

endif 

case op of 
E.DEPI: 

E.UDEPI: 

E.MDEPI: 

m 4- RegRead(rb. 64) 

b m63..isize+ishift " aigize-i-.O " frij shift- i..O 
E.WTHI: 

^ ^ 4te +is^hift-i " 2isi2S.isnitt-i..ishitt 
E.UWTHI: 

b^064-size „ aisizo+ishift.:Jshift 

endcase 
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RegWrite(rb. 64. b) 
enddef 

Exceptions 
Resen'ed instruction 
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Execute Immediate 

These operations perform calculations with one general register value and 
immediate value, placing the result in a general register. 



Operation rndfi^et 



E.ADD.I 


Execute add immediate 


E.ADD.I.O 


Execute add immediate and check signed overflow 


E ADD.I.UO 


Execute add immediate and check unsigned overtlow 


E.AND.I 


Execute and immediate 


E.NAND.I 


Execute not and immediate 


E.NOR.I 


Execute not cr immediate 


E.OR.I 


Execute or immediate 


E.XORJ 


Execute xor immediate 



class 


operation 


check 


arithmetic 


ADD 


NONE 


0 


uo 


bitwise 


AND OR 
XOR 


HAND NOR 




Format 














op rb 


=ra,imm 










31 




24 23 


18 17 12 11 






0 


1 


op 


1 ra 


1 rb 1 


ii 


mm 


1 




8 


6 


6 




12 





Description 



The contents of register ra is fetched, and a 64-bit immediate value is sign- 
extended from the 12 -bit imm field. The specified operation is performed on these 
operands. The result is placed into register rc. 

Definition 

def Executelmmediate(op,ra.rb.imm) as 

i (imm^^ n imm) 

a RegRead(ra. 64) 
case op of 
E.AND.I: 

b a and i 
E.OR.I: 

b a or i 
E.NAND.I: 

b ^ a nand i 
E.NOR.I: 

b £ nor i 
E-XOR.I: 

b a xor i: 
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E.ADD.I: 

b 4— a i 

E.ADD.LSO: 

t (a63 " a) + ('63 " 0 
»f t64 * t63 then 

raise FixedPoinlArithmetic 

end if 

endcase 

RegWfite(rb, 64. b) 
enddef 

Exceptions 
Fixed-point arithmetic 
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Execute ImmediafR ReversGci 

These operations perform calculations with one general register value and one 
immediate value, placing the result in a general register. 

Ooersticn codes 



hSET.I.E 


Execute set immediate equal 


h.bcT.I.GE 


Execute set immediate signed greater or equal 


LSET.I.L 


Execute set immediate signed less 


E.SET.I.NE 


Execute set immediate not equal 


E.SET.I.UGE 


Execute set immediate unsianed greater or equal 


E.SET.I.UL 


Execute set immediate unsigned less 


E.SUB.I 


Execute subtract immediate 


E.SUB.I.E 


Execute subtract immediate and check equal 


E.SUB.I.GE 


cxecuie suo;rs:: immaaiate ana cnec< s.gnea greater or equsi 


E.SUB.I. L 


Execute subtract immediate and check signed less 


E.SUB.I. NE 


Execute subtract immediate and check not equal 


E.SUB.I.O 


Execute subtract immediate and check signed overflow 


E.SUB.I.UGE 


txecute suDirac; imrrwciais ana cnecK unsigneo greater or equai 


E.SUB.I. UL 


Execute subtract immediate and check unsigned less 


E.SUB.I. UO 


Execute subtract immediate and check unsigned overNow 



Class 


operation 


check 


arithmetic 


SUB 


NONE 0 UO 
E L UL 
NE GE UGE 


boolean 


SET.E SELL SET.UL 
SET.NE SET.GE SET.UGE 





Format 

op rb=imm.ra 

31 24 23 

c 



op 



I 



ra 



1B 17 



12 11 



rb 



imm 



8 



12 



Descn'otion 

The contenrs of register ra is fetched, and a 64>bit immediate value is sign- 
extended from the 12-bit imm field. The specified operation is performed on these 
operands. The result is placed into register re. 

Definition 

del Execuielmmediale(op.ra.rb.imm) as 
i <- (imm-| ^^2 h jn^n^j 
a <- RegRead{ra. 64) 
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case op of 
E.SUB.I: 

b <- i - a 
E.SUB.I.SO: 

t ^ ('63 " ») * (363 II a) 
if t64 * t63 then 

raise FixedPolnlArilhmetic 

endif 

b t63. 0 
ESET.I.E: 

b 4- (i = 3)64 

E.SET.I.NE: 

b^(i?ta)64 . 

E.SET.I.L: 

b <- (i < a)6^ 
E.SET.I.GE: 

b ^ (I > af^ 
E.SET.I.UL: 

b {{0 II i) < (0 II a))^"^ 
E.SET.I.UGE: 

b ((0 li i) > (0 II a))^ 
E.SUB.LE: 

b i - a 

if i a then 

raise FixedPointArithmetic 

endif 
E.SUB.I.NE: 
b <— I - a 
il i = a then 

raise FixedPointArithmetic 

endif 
E.SUB.I.L: 
b ^ i - a 
it i > a then 

raise FixedPointArithmetic 

endif 
E.SUB.I.GE: 
b <- i - a 
if 1 < athen 

raise FixedPointArithmetic 

endif 
E.SUB.LUL: 
b i - a 

if (0 II i) > (0 11 a) then 

raise FixedPointArithmetic 

endif 
.E.SUB.I.UGE: 
b <— i - a 

if (0 II i) < (0 II a) then 

raise FixedPointArithmetic 

endif 

endcase 

RegWrite(rb. 64, b) 
enddef 

Exceptions 
Fixed-point arithmetic 
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Execute Inolace 

These operations perform calculations with three general register values, placing 
the result in the third general register. 

Ooerarian codiss 

lE.MSHR I Execute merge shift right ^ I 



Format 

E.MSHR rc=ra.rb.rc 

31 24 23 18 17 12 11 65 0 

i E>MlNOR I ra I rb i rc I op I 

8 6 6 6 6 

Descriction 

The contents of registers ra, rb, and rc are fetched. The specified operation is 
performed on these operands. The result is placed into register rc. 

Definition 

def ExecuteTernarylnpiace{op.ra.rb,rc) as 
a RegRead(ra. 64) 
b ^ RegRead(rb. 64) 
c ^ RegRead{rc. 64) 
case op of 
E.MSHR: 

endcase 

RegWrite(rc. 64. d) 
enddef 

Exceptions 
none 
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Execute Reversed 

These operations perform calculations with two general register values, placin" 
the result in a general register. ' " 



Operation code$ 



ESET.E 


Execute set equal 


E.SET.GE 


Execute set signed greater or equal 


E.SET.L 


Execute set signed less 


E.SET.NE 


Execute set not equal 


E.SET.UGE 


Execute set unsigned greater or equal 


E.SET.UL 


Execute set unsiqned less 


E.SUB 


Execute subtract 


E.SUB.E 


Execute subtract- and check equal 


E.SUB.GE 


Execute subtract and check signed greater or equal 


E.SUB.L 


Execute subtract and check signed less 


E.SUB.NE 


Execute subtract and check not equal 


E.SUB.O 


Execute subtract and check signed overflow 


E.SUB.UGE 


Execute subtrsci and check unsianed greater or equal 


E.SUB.UL 


Execute subtract and check unsiqned less 


E.SUB.UO 


Execute subtract and check unsigned overflow 



class 


operation 


check 


arithmetic 


SUB 


NONE 0 UO 
E L UL 
NE GE UGE 


boolean 


SET.E SET.L SET.UL 
SET.NE SET.GE SET.UGE 





Format 



op rc=rb,ra 

31 2A 23 18 17 12 11 6 5 0 

I E,MINOR I ra I rb I rc I op I 

8 6 6 5 6 

Descriotion 

The concents of registers ra and rb are fetched and the specified operation is 
performed on these operands. The resul: is placed into register rc. 

Definition 

def ExecuteReversed{OD.ra.rb.rc) as 
a 4- RegRead(ra. 64) 
b <- RegRead(rb. 64) 
case op of 
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E.SUB: 

c ^ b - a 

E.SUB.O: 

t ^ (b63 II b) . (363 " a) 
t54 « t63 then 

raise FixedPointArithmetic 

endif 

c ^ t63..0 
E.SUB.UO: 

t^{0^ llb)-(Oi 11 a) 
if t64 ?± 0 then 

raise FixedPointArithmetic 

endif 
c <- t63..0 
E.SUB.E: 

c b - a 
if b ?!: a then 

raise FixedPointArithmetic 

endif 
E.SUB.NE: 
c <- b - a 
if b = a then 

raise FixedPointArithmetic 

endif 
E.SUB.L: 

c f- b - a 
if b > a then 

raise FixedPointArithmetic 

endif 
E.SUB.GE: 
c b - a 
if b < a then 

raise FixedPointArithmetic 

endif 
E.SUB.UL: 
c 4— b - a 

if (0 II b) > (0 II a) then 

raise FixedPointArithmetic 

endif 
E.SUB.UGE: 
c 4- b - a 

if (0 II b) < (0 II a) then 

raise FixedPointArithmetic 

endif 
E.SET.E: 

c (b a)64 
E.SET.NE: 

E.SET.L: 

c 4- {b < a)64 
E.SET.GE: 

c (b > a)6^ 
E.SET.UL: 

c 4- ((0 II b) < (0 II a))6^ 
E.SET.UGE: 

c <- {(0 n b) > (0 11 a))6^ 

endcase 
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RegWfite(rc, 64, c) 
enddef 

Exceptions 
Fixed-point arithmetic 
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Execute Short Immfidif^tR 

These operations perform calculations with one general register value and one 
immediate value, placing the result in a general register. 



Ooerstion codes 



E.ROTR.I 


Execute rotate right immediate 


E.SHU 


Execute shift left immediate 


E.SHLI.O 


Execute shift left immediate and check signed overflow 


E.SHL.I.UO 


Execute shift left immediate and check unsigned overflow 


E.SHR.I 


Execute signed shift right immediate 


E.SHUFFLE.I 


Execute shuffle immediate 


E.USHR.I 


Execute unsigned shift right immediate 


Format 




op rb=ra.simm 




31 24 23 


18 17 12 11 6 5 0 


1 E.MINOR 1 


ra 1 rb 1 simm | op f 


8 


6 6 6 6 



Description 

The contents of register ra is fetched, and a 6-bit immediate value is taken from 
the 6-bit simm field. The specified operation is performed on these operands. The 
result is placed into register rb. 

Definition 

def ExecuteShortlmmediate(op.ra.rb.simm) as 
a <- RegRead(ra. 64) 
case op of 

E.SHUFFLE.I. 

case simm of 
0: 

b <- a 
1..35: 

for X 0 to 7; for y 0 to x-1: for z 1 to x-y 

if simm = ((x'x-x-3-x*x-4*x)/6-(2*z-2)/2+x*z+y+1) then 
tor i «- 0 to 63 

end 

endil 

endfor: endfor; endfor 
36.. 255: 

raise Reservedlnstruciion 

endcass 
E.ROTR.I: 

b <- asjrTvn-1..0 " 363..simm 
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E.SHL.I: 

E.SHLI.O: 

i^a63..63.simm^a|'3"^'^^^ then 
raise FixedPointArilhmetic 

endif 

b <- a 63-simm..O » OS=""nn 
E.SHLI.UO: 

if 363. .54-sinim * 0 then 

raise FixedPointArithmetic 

endif 

b a 63.simm..O H OS""«i 
E.SHR.I: 

b^aea^imm iiae3..5i„,„ 

E.USHR.I: 

b-OS-'T'm Ila63..3imm 

endcase 

RegWnte(rb. 64. b) 
enddef 

Exceotinn<:i 
Fixed-point arithmetic 
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Execute Short Immediate lnplRnf=^ 

These operations perform calculations with one general register value and one 
immediate value, placing the resuh in a general register. 

Operation codes 

lE.MSHRJ I Execute merge shift right immediate { 

Formst 

op rb=ra,simm 

3J 24 23 18 17 12 11 65 Q 

I E.MINOR I ra i rb I simm | op | 

8 6 6 6 6 

Description 

The concents of registers ra and rb are fetched, and a 6-bit immediate value is 
taken from the 6-bit simm field. The specified operation is performed on these 
operands. The result is placed into register rb.. 

Definition 

def ExecuteShortlmmediatelnplace(op.ra.fb,simm) as 
a <- RegRead(ra. 64) 
b <- RegRead(rb. 64) 
case op of 

E.MSHR.I: 

c <- b63..63-simnn H 353. 

endcase 

RegWrite(rb. 64. c) 
enddef 

Exceptions 
Fixed-point arithmetic 
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Execute Swizzin Immediate 

These operations perform calculations with a general register value and two 
immediate values, placing the result in a general register. 

Oceration codes 

[E.SWIZZLE.I I Execute swizzle immediate | 

Eomnat 

op rb=raJcopyJswap 

p. P i8 17 12 11 65 0 

I I ra t rb I icopy \ iswap I 

8 6 6 6 6 

Description 

The contents of register ra is fetched, and 6-bit immediate values are taken from 
the 6-bit icopy and iswap fields. The specified operation is performed on these 
operands. The result is placed into register rb. 

Deftnition 

def GroupSwjz2lelmmediate(op,ra.rbJcopyjswap) as 
a <- RegRead(ra. 64) 
for i ^ 0 to 63 

bi <- a(j & icop/) A iswap 
endfor 

RegWrite(rb. 63. b) 
enddef 

Exceptions 
Reser\'ed instruction 



120 



Case 2:05-cv-O05O5-TJW Document 1 49 Filed 10/1 5/2007 Page 4 of 40 

wo 97/07450 PCTAJS9fi^3047 



Execute Ternary 

These operations perform calculations with three general register values, placing 
the result in a fourth general register. 



Operation codes 



E.8MUX 


Execute 8-way multiplex 


E.MUX 


Execute multiplex 


E.TRANSP0SE.8MUX 


Execute transoose and B-way multiplex 



Format 



op rd=ra,rb.rc 

31 24 23 18 17 12 11 65 0 

I I ra I rb I rc I rd I 

8 6 6 6 6 

Descriction 

The contents of registers ra, rb, and rc are fetched. The specified operation is 
performed on these operands. The result is placed into register rd. 

Definition 

def ExecuteTemary(op.ra.rb,rc.rd) as 
case op of 

E.8MUX. E.TRANSP0SE.8MUX: 

a <- RegRead(ra. 64) 

b RegRead(rb. 128) 

c ^ RegRead(rc. 64) 
E.MUX: 

a <- RegRead{ra. 64) 

b RegRead(rb. 64) 

c 4- RegReadjrc. 64) 

endcase 
case op of 
E.8MUX: 

for i <- 0 to 63 

endfor 
E.TRAf^SPOSE.SMUX: 
for i ^ 0 to 63 

endfor 

for i <- 0 to 63 

endfor 
E.MUX: 

d <- (b and a) or (c and not a) 

endcase 
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RegWrjte(rd. 64. d) 
enddef 

Exceptions 
none 
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Floating' point 



These operations perform floating-point arithmetic on two floating-point operands. 

Oce^'ation codes 



r. AUU. 1 b 


Floating-poini add half 


C A PkPv i C 

r.AUU. ID.C 


Floating-DOint add half ceiling 


r.AUU.iD.F 


Floating-Doint add half floor 


F.ADD.16.N 


Floating-point add half nearest 


F.ADD.16.T 


Floating-ooini add half truncate 


F. ADD. 16.x 


Floating-point add half exact 


F.ADD.32 


Floating-Doint add single 


F.ADD.32.C 


Floating-point add single ceiling 


F.ADD.32.F 


Floating-Doint add single floor 


F.ADD.32.N 


Floating-point add single nearest 


F.ADD.32.T 


Floating-point add single truncate 


F.ADD.32.X 


Floating-point add single exact 


F.ADD.64 


Floating-point add double 


F.ADD.64 .C 


Floating-point add double ceiling 


F.ADD.64 .F 


Floating-point add double floor 


F.ADD.64 .N 


Floating-point add double nearest 


F.ADD.64 .T 


Floating-point add double truncate 


F.ADD.64 .X 


Floating-DOint add double exact 


F ADD.128 


Floating-point add guad 


F.ADD.128.C 


Floating-point add guad ceiling 


F.ADD.128.F 


Floating-point add quad floor 


F.ADD.128.N 


Floating-point add guad nearest 


F.ADD.128.T 


Floating-point add quad truncate 


F.ADD. 128.x 


Floating-point add quad exact 


F.D1V.16 


Floating-point divide half 


F.DIV.ie.C 


Floating-point divide half ceiling 


F.UIV.16.F 


Floating-point divide half floor 


F.DIV.16.N 


Floating-point divide half nearest 


F.DIV.16.T 


Floating-point divide half truncate 


F.DIV.16.X 


Floating-point divide half exact 


F.DIV.32 


Floating-point divide single 


F.DIV.32.C 


Floating-point divide single ceiling 


F.DIV.32.F 


Floating-point divide single floor 


F.DIV.32.N 


Floating-point divide single nearest 


F.DIV.32.T 


Floating-point divide single truncate 


F.DIV.32.X 


Floating-point divide single exact 


F.DIV.64 


Floating-point divide double 


F.DIV.64.C 


{-loating-point divide double ceiling 


F.DIV.64.F 


Floating-point divide double floor 


F.DIV.64.N 


Floating-point divide double nearest 
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F.DI\/.64.T 


Floating-point divide double truncate 


F.DIV.64.X 


Floating-point divide double exact 


F.DIV.128 


Floating-point divide quad 


F.DIV.128.C 


Floating-DOint divide quad ceiling 


F.DI\/.128.F 


Floating-point divide quad floor 


F.DIV.128.N 


Floating-point divide quad nearest 


F.DIV.128.T 


Floating-point divide quad truncate 


F.DIV. 128.x 


Floating-point divide quad exact 


F.MUL16 


Floating-point multiply half 


F.MUL16.C 


Floating-point multiply half ceiling 


F.MUL16.F 


Floating-point multiply half floor 


F.MUL.16.N 


Floating-point multiply half nearest 


F.MUL.16.T 


Floating-ooint multiply half truncate 


F.MUL16.X 


Floating-point multiply halt exact 


F.MUL.32 


Floating-point multiply single 


F.MUL.32.C 


Floating-point multiplv sinoie ceilinq 


F.MUL.32.F 


hloating-Dcint multiply single floor 


F.MUL.32. N 


Floating-point multiply single nearest 


F.MUL.32.T 


Floating-point multiply sinoie truncate 


F.MUL.32.X 


Floating-point multiply single exact 


F.MUL.64 


Floating-point multiply double 


F.MUL.64.C 


Floating-point multiply double ceilinq 


F.MUL.64. F 


Floating-point multiply double floor 


F.MUL.64.N 


Floating-point multiply double nearest 


F.MUL.64.T 


Floating-point multiply double truncate 


F.MUL.64.X 


Floating-point multiply double exact 


F.MUL.128 


Floating-point multiply quad 


F.MUL.128.C 


Floating-point multiply quad ceilinq 


F.MUL.128.F 


Floating-point multiply quad floor 


F.MUL.128.N 


Floating-point multiply quad nearest 


F.MUL.128.T 


Floating-point multiply quad truncate 


F.MUL.128.X 


Floating-point multiply quad exact 





op 


prec 


round/trap 


add 


ADD 


16 32 64 128 


NONE C F N T X 


multiply 


MUL 


16 32 64 128 


NONE C F N T X 


divide 


DIV 


16 32 64 128 


NONE C F N T X 



Format 

F.op. prec. round rc=ra.rb 

31 24 23 18 17 12 11 65 0 

I F,prec I ra I rb I rc I op,round | 

8 6 6 6 5 
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Descnotion 

The contents of registers is register pairs specified by ra and rb are combined 
using the specified floating-point operation. The operation is rounded using the 
specified rounding option or using round-to-nearesr if not specified. The result is 
placed in the register or register pair specified by rc. 

If a rounding option is specified, the operation raises a floating-point exception if a 
floating-point invalid operation, divide by zero, overflow, or underflow occurs, or 
when specified, if the result is inexact. If a rounding option is not specified, 
floating-point exceptions are not raised, and are handled according to the default 
rules of IEEE 754. 

li F128 precision is specified, ra, rb and rc refer to an aligned pair of registers, and 
a reserved instruction exception occurs if the low-order bit of these operands is 
sec. 

Definiticn 

def FloatingPoint{op.prec.round.ra.rb.rc) as 

a ^ F(prec. RegRead(ra. (prec<64) ? 6-: : prec)) 
b F(prec. RegRead(rb. (prec<64) ? 64 : prec)) 
if round^NQNE then 

if isSignallingNaN(a) I isSignallingNaN(b) 
raise FloatingPointException 

endif 

case op of 
F.DIV: 

if b=0 then 

raise FloalingPointArithmetic 

endif 
others: 
endcase 

endif 

case op of 
F.ADD: 

c <- a+b 
F.MUL: 

c <- a"b 
F.DIV.; 

c «- a/b 

endcase 
case round of 

X: 

N: 

T: 

F: 

C: 

NONE: 
endcase 

RegWfite{rc, (prec<64) : 64 : prec. PackF(prec.c)) 
enddef 
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Floating-point ReversRd 

These operations perform floating-point arithmetic on t\vo floating-point operands. 



Oceraticr- codes 



F.SET.E.16 


1 iwaiii i^'^uii 11 oci cLIUdl nail 


F.SET E 16X 


1 iuuiniy puiiu :aef eQUdl nan exact 


F.SET. E.32 


rjudiM ly-punu bei equai sinaie 


F SET E 32 X 


rioaiing-poini set equal single exact 




riudiiny-pomi set equal double 




hiudung-poini sei equal double exact 


C OCT C 1 9Q 


Floating-point set equal quad 


r.ot 1 .t- 1 ^O. A 


Floating-poini set equal quad exact 


r.oc I .Ot. lO.A 


Floating-potnr set greater or eaual half exact 


r.bb 1 .(jc.o^f.A 


Floating-point set greater or equal single exact 


r.ob 1 .ob.bA.X 


Floating-DOint set greater or eaual double exact 


r.bb 1 .Ub. irfo.A 


i-ioating-point set greater or equal quad exact 


r.bb 1 .L. ib.A 


Floating-point set less half exact 


r.bb! L.o^.X 


Floating-point set less single exact 


r.bbT.L.o4.X 


Floating-point set less double exact 


r.bb 1 L. i^o.X 


Floating-point set less quad exact 


F.SbT.NE.16 


Floating-point set not equal half 


F. SET. NE. 16.x 


Floating-point set not equal half exact 


F.SET. NE. 32 


Floating-point set not egual single 


F.SET. Nb.32.X 


Floating-point set not equal single exact 


r o rT Mcr ' 
r.bEr.NE.64 


Floating-pomt set not equal double 


r.bbT.Nb.D4.X 


Floating-point set not egual double exact 


F.SET.Nb.128 


Floating-point set not equal quad 


r.bb 1 .Nb.12o.X 


Floating-point set not equal quad exact 


r.bb 1 .Nbb.lb.X 


Floating-point set not greater or equal half exact 


C CCT M/^C DO V 


Floating-point set not greater or equal sinole exact 


F SET NGF 64 X 


riuduiiy-puirii bei noi greaier or equal aouDie exact 


F.SET.NGE.128.X 


Floating-point set not greater or equal quad exact 


F.SET,NL,16.X 


Floating-point set not or less half exact 


F.SET.NL.32.X 


Floating-point set not or less single exact 


F,SET.NL.64.X 


Floating-point set not or less double exact 


F.SET.NL 128.x 


Floating-point set not or less auad exact 


F.SET.NUE.16 


Floating-point set not unordered or equal half 


F.SET.NUE.16.X 


Floating-point set not unordered or equal half exact 


F.SET.NUE.32 


Floating-point set not unordered or equal single 


F.SET.NUE.32.X 


Floating-point set not unordered or equal single exact 


F.SET.NUE.64 


Floating-poini set not unordered or equal double 


F.SET.NUE.64.X 


Floating-point set not unordered or equal double exact 


F.SET.NUE.128 


Ftoating-poini set not unordered or equal quad 


F.SET.NUE. 128.x 


Floating-point set not unordered or equal quad exact 
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F.SET.NUGE.16 


Fioating-point set noi unordereo greater or equal half 


F.SET.NUGE.32 


Hoating-point se: not unordered greater or equal single 


F.SET.NUGE.64 


Floating-point set not unordered greater or equal double 


F.SET.NUGE.128 


Hloating-poinr set not unordered greater or equal quad — 


F.SET.NUL.16 


Floating-point set not unordered or less half 


F.SET.NUL32 


Floating-point set not unordered or Iprr «;innip 


F.SET.NUL64 


Floating-point set not unorderr^ri nr Ipss donhlp 


F.SET.NUL128 


Floating-point set not unordered or less quad 


F.SET.UE.16 


Floating-point set Greater or f^mi^i h;:^if 


F.SET.UE.16.X 


Floating-point set Greater or eauai half pyarf 


F.SET.UE.32 


Floating-point set Greater or eauai c;innip 


F.SET.UE.32.X 


Floating-point set oreater or enuai Qinnip pva^^t 


F.SET.UE.64 


Floating-point set oreater or pnnai rinnhip 


F.SET.UE.64.X 


Floating-DOini set oreater nr phupI Hm ihio ov^of 


F.SET.UE.128 


Floatino-Doint sst arpatpr nr pmiiai miprf 


F.SET.UE.128.X 


FiOatlnQ-DOirt SPt nr^^tp^r nr pm isl mmrl Qv-»r-t 


F.SET.UGE.16 


iwtjiiiiy pv^iiti Owl uinjiucfcu yreaier or eguai nait 


F.SET.UGE.32 


F OatinO-OOirt Q^f rmnrHprpH r^r^r:lfc^r r\r i^^ii-^i 

iv^wuny K unuruereu greaier or eguai sinole 


F.SET.UGE.64 


v^v^vtv^ ^wii unwivjcicu yicaicf or eguai QOUDIB 


F.SET.UGE.128 


« n^auiiy puiiii ^cJi unoruerea ureaier or eguai guad 


F.SET.UL.16 


1 iwaiiiiy pun 11 b^i unoruereu or less halt 


F.SET,UL32 


1 luauiiy puiru unoroereo or less single 


F.SET.UL.64 


1 iwain ly ^juii H btfi unoruereo or less double 


F.SET.UL128 


% iwaiM ty pun 11 urioraereQ or less guao 


F.SUB.16 




F.SUB.16.C 


Floatina-DOint ^nhfrart half r^ciilin/-! 


F.SUB.16.F 


Floatina-Doint «;nhtr^rt h^lf flnrir 


F.SUB.16.N 


Floatlnn-C^oirt ^^^uhtrart h;*lf npQTQct 


f-.SUB 16.T 


Floatina-Doint subtrart h^^lf tmnratp 


F.SUB.16.X 


Floatina-DOint subtract half p^rt 


F.SUB.32 


Floatino-Doint subtrart cjinnip 


F.SUB.32.C 


Floatina-Doint subtract sinnip rpiiinn 


F.SUB.32.F 


Floatina-Doint subtract sinnip finnr 


F.SUB.32.N 


Floating-point subtract sinnle npArp<;t 


F.SUB.32.T 


FloatinQ-DOint subtract sinnip triinr;:*tp 


F.SUB.32.X 


Floatino-Doint subtract sinnip pvarr 


F.SUB.64 


Floalina-DOint subtract riouhip 


F.SUB.64.C 


Floating-point subtract double ceiling 


F.SUB.64.F 


Floating-point subtract double floor 


F.SUB.64.N 


Floating-point subtract double nearest 


F.SUB.64,T 


Floating-point subtract double truncate 


F.SUB.64.X 


Floating-point subtract double exact 


F.SUB.128 


Floating-point subtract guad 


F.SUB.128.C 


Floating-pofnt subtract guad ceiling 


F.SUB.128.F 


Floating-point subtract quad floor 


F.SUB.128.N 


Floating-point subtract quad nearest 
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|F.SUB.128.T 


Floating-point subtract quad truncate | 


|f.SUB.128.X 


Floating-point subtract quad exact | 





op 


prec 


round/trap 


set 


SET. 

E NE 
UE NUE 


16 32 64 128 


noneX 




SET. 

NUGE NUL 
UGE UL 


16 32 64 128 


NONE 




SET. 

L GE 
NL NGE 


16 32 64 128 


X 


subtract 


SUB 


16 32 64 128 


NONE C F N T X 



Fprmat 



F .op . p rec , round rc = r b . ra 

3] 2^23 18 17 12 11 65 0 

I F.prec I r a I rb I rc I op^roundl 

9 6 6 6 6 

Descriotion 

The concents of registers or register pairs specified by ra and rb are combined 
using the specified floating-point operation. The operation is rounded using the 
specified rounding option or using round-to-nearest if not specified. The result is 
placed in the register or register pair specified by rc. 

If a rounding option is specified, the operation raises a floating-point exception ii a 
floating-point invalid operation, divide by zero, overflow, or underflow occurs, or 
when specified, if the result is inexact. If a rounding option is not specified, 
floating-point exceptions are not raised, and are handled according to the defaul: 
rules of IEEE 754. 

If F128 precision is specified, ra, rb and rc refer to an aligned pair of registers, and 
a reserved instruction exception occurs if the low-order bit of these operands is 
set. 

Definition 

del FloatingPointReversed(op.prec.round.ra.rb.fc) as 
a F(prec. RegRsad(ra. {prec<64) ? 64 : prec)) 
b F(prec. RegRead(rb. (prec<64) ? 64 : prec)) 
if roundxNONE then 

if isSignallingNaN(a) I isSlgnaHingNaN(b) 
raise FloatingPointException 

endif 

case op of 

F.SET.L. F.SET.GE. F.SET.NL. F.SET.NGE; 
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if isNaN(a) I isNaN(b) then 

raise FloatingPoiriArithmetic 

endif 
others: 
endcase 

endif 

case op of 
F.SUB: 

c b-a 
F.SET.NUGE. F.SET.L: 

c <~ b!?>a 
F.SET.NUL F.SET.GE: 

c bl?<a 
F.SET.UGE. F.SET.NL: 

c b?>a 
F.SET.UL. RSET.NGE: 

c «- b?<a 
F.SETUE: 

c <- b?=a 
F.SET.NUE: 

c 4- bl?=a 
F.SET.E: 

c ir- b=a 
F.SET.NE: 

c b^a 

endcase 
case op of 
F.SUB: 

destprec prec 
F.SET.NUGE. F.SET.NUL F.SET.UGE. F.SET.UL 
F.SET.L. F.SET.GE. F.SET.E. F.SET.NE. F.SET.UE. F.SET.NUE: 
destprec INT 

endcase 
case round of 

X: 

N: 

T: 

F: 

C: 

NONE: 
endcase 

case destprec of 

16. 32. 64. 128: 

RegWrite(rc, (destprec<64) : 64 : destprec, PackFfdestprec.c)) 

INT: 

RegWrite{rc. 64. c) 

endcase 
enddef 

Exceptions 

Resen-'ed instruction 
Floating-point arithmetic 
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Floating-point TRrnRrv 

These operations perform floating-point arithmetic on three 
operands.. 



Operation codf^fi 



F.MULADD.16 


Floating-point multiply and add half 


F.MULADD.32 


Floating-point multiply and add single 


F.MULADD.64 


Floating-point multiply and add double 


F.MULADD.128 


Floating-point multiply and add quad 


F,MULSUB.16 


Floating-point multiply and subtract half 


F.MULSUB.32 


Floating-point multiply and subtract single 


F.MULSUB.64 


Floating-point multiply and subtract double 


F.MULSUB.128 


Floating-point multiply and subtract quad 





op 


prec 


multiply and add 


MULADD 


16 


32 


64 


128 


multiply and subtract 


MULSUB 


16 


32 


64 


128 


Format 














F.operation.type rd=raTb,rc 

31 24 23 


18 


17 12 11 




6 5 




0 


1 op 1 




1 rb 1 


rc 


1 


rd 


1 


8 6 




6 


6 




6 





Descriotion 

The conrents of registers or register pairs specified by ra and rb are multiplied 
together and added to or subtracted from the contents of the register or register 
pair specified by rc. The result is rounded to the nearest representable floating- 
point value in a single floating-point operation. The result is placed in the register 
or register pair specified by rd. Floating-point exceptions are not raised, and are 
handled according to the defauh rules of IEEE 754. These instructions cannot 
select a directed rounding mode or trap on inexact. 

If FI28 precision is specified, ra, rb. rc and rd refer to an aligned pair of registers, 
and a reserved instruction exception occurs if the low-order bit of these operands 
is set. 

Definition 

del FIoatingPorniTefnary(op.ra.rb.rc,rd) as 
case op of 

FMULADD16. FMULSUB16: 

prec <- 16 
FMULADD32. FMULSUB32: 
prec ^ 32 
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FMULADD64. FMULSUB64: 

prec 64 
FMULADD128. FMULSUB128: 

prec 128 

endcase 

a <r- F{prec. RegReadCra. (pfec<64) ? 64 : prec)) 
b <~ F(pfec. RegRead(rb. (prec<64) ? 64 ; prec)) 
c <- F(prec. RegRead(rc. (prec$64) ? 64 : prec)) 
case op of 

FMULADD16, FMULADD32. FMULADD64. FMULADD128 
d ^ a'b+c 

FMULSUB16. FMULSUB32. FMULSUB64. FMULSUB128- 
d <- a'b-c 

endcase 

RegWrile(rd. (precS64) : 64 : prec. PackF(prec.d)) 

enddef 

Exceptions 

Reserved instruction 
Floating-point arithmetic 
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Floating -point Unarv 

These operations perform floating-poin: arithmetic on one floating-point operand. 



Qc-eravcr. codes 



F.ABS.16 


Floating-pcint absolute value half 


F.ABS.16.X 


FloatinQ-DCini absolute value halt exact 


F.ABS.32 


Floating-point absolute value single 


F.ABS.32.X 


Floating-pc-p.i absolute value single exact 


F.ABS.64 


Floating-pcini absolute value double 


F.ABS.64.X 


Floating-DC-rt absolute value double exact 


F.ABS.128 


Floating-DOint absolute value quad 


F.ABS.128.X 


Floating-pcr.t absolute value quad exact 


F. DEFLATE. 32 


Fioating-pc":: convert half from single 


F.DEFLATE.32.C 


Floating-DCtrr convert half from single ceiling 


F.DEFLATE.32.F 


Floating-cc convert half from single floor 


F. DEFLATE. 32.N 


Floating-cc:r.t convert half from single nearest 


F.DEFLATE.32J 


Floating-pc:^:! convert half from single truncate 


F.DEFLATE.32.X 


Floating-pcjr^i convert half from single exact 


F.DEFLATE.64 


Fioating-pcini convert single from double 


F.DEFLATE.64.C 


Floating-pcini convert single from double ceiling 


F.DEFLATE.64.F 


Floating-point convert single from double floor 


F.DEFLATE.64.N 


Floating-poinr convert single from double nearest 


F.DEFLATE.64.T 


Floating-point convert single from double truncate 


F.DEFLATE.64.X 


Floating-point convert single from double exact 


F.DEFLATE.128 


Floating-point convert double from quad 


F. DEFLATE. 128. C 


Floating-point convert double from quad ceiling 


F. DEFLATE. 128. F 


Floating-pcm: convert double from quad floor 


F.DEFLATE.128. N 


Floating-point convert double from quad nearest 


F. DEFLATE. 128.T 


Floafing-point convert double from quad truncate 


F.DEFLATE.128.X 


Floating-point convert double from guad exact 


F. FLOAT. 16 


Floating-point convert half from integer 


F.FL0AT.16.C 


Floating-point convert half from integer ceiling 


F.FL0AT.16.F 


Floating-point convert half from integer floor 


F.FLOAT.16.N 


Floating-point convert half from integer nearest 


F.FL0AT.16.T 


Floating-point convert half from integer truncate 


F.FL0AT.16.X 


Floating-point convert half from integer exact 


F.FLOAT.32 


Floating-point convert single from integer 


F. FLOAT. 32. C 


Floating-point convert single from integer ceiling 


F. FLOAT. 32. F 


Floating-point convert single from integer floor 


F.FLOAT.32.N 


Floating-point convert single from integer nearest 


F.FLOAT.32T 


Floating-point convert single from integer truncate 


F.FLOAT.32.X 


Floating-point convert single from integer exact 


F.FLOAT.64 


Floating-point convert double from integer 


F.FLOAT.64.C 


Floating-point convert double from integer ceiling 
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F.FLOAT.64.F 


Floatinq-poini convert aouDle from integer floor 


F.FLOAT.64.N 


Floating-point convert double from integer nearest 


F. FLOAT. 64. T 


Floating-point convert doulDle from integer truncate 


F.FLOAT.64.X 


Floating-point convert double from integer exact 


F, FLOAT, 128 


Floating-point convert quad from integer 


FJNFLATE.16 


Floating-point convert single from ha\i 


F.INFLATE.16.X- 


Floating-Doint convert single from half exact 


F.INFLATE.32 


Floating-point convert double from single 


F.INFLATE32.X 


Floating-point convert double from single exact 


F.INFLATE.64 


Floating-point convert guad from double 


F.INFLATE.64.X 


Floating-point convert quad from double exact 


F.NEG.16 


Floating-point negate half 


F.NEG.16.X 


Floating-point negate half exact 


F.NEG.32 


Floating-point negate single 


F.NEG.32.X 


Floating-point negate single exact 


F.NEG.64 


Floating-point negate double 


F.NEG.64.X 


Floating-Doint negate double exact 


F.NEG.128 


Floating-point negate quad 


F.NEG. 128.x 


Floating-point negate quad exact 


F.SINK,16 


Floating-point convert 


integer from half 


F.SINK.16.C 


Floating-point convert 


integer from half ceiling 


F.SINK.IS.F 


Floating-point convert 


integer from half floor 


F.SINK.16.N 


Floating-point convert 


integer frorti half nearest 


F.S1NK.16J 


Floating-point convert 


integer from half truncate 


F.SINK.16.X 


Floating-point convert 


integer from half exact 


F.SINK,32 


Floating-point convert 


integer from single 


F.SINK.32.C 


Floating-point convert 


integer from single ceiling 


F.SINK.32.F 


Floating-poinr convert 


integer from single floor 


F.SINK.32.N 


Floating-point convert 


nteger from single nearest 


F.SINK.32.T 


Floating-point convert 


integer from single truncate 


RSINK,32.X 


Floating-point convert 


nteger from single exact 


F.SINK.64 


Floating-point convert 


nteger from double 


F.SINK.64.C 


Floating-point convert 


integer from double ceiling 


F.SINK.64.F 


Floating-point convert 


nteger from double floor 


F.SINK.64.N 


Floating-point convert 


nteger from double nearest 


F.SINK.64.T 


Floating-point convert 


nteger from double truncate 


F.SINK.64.X 


Floating-point convert i 


nteger from double exact 


F.SINK.128 


Floating-point convert i 


nteger from quad 


F.SINK.128.C 


Floating-point convert 


nteger from quad ceiling 


F.SINK.128.F 


Floating-point convert i 


nteger from quad floor 


F.SINK.128.N 


Floating-point convert i 


nteger from quad nearest 


F.SINK.128.T 


Floating-point convert i 


nteger from quad truncate 


F.SINK.128.X 


Floating-point convert i 


nteger from quad exact 


F.SQR.16 


Floating-point square root halt 


F.S0R.16.C 


Floating-point square root half ceiling 


F.SQR.16.F 


Floating-point souare root half floor 
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F.S0R.16.N 


Floating-point square root half nearest 


F.SQR.16.T 


Floating-poini square root half truncate 


F.SQR.16.X 


Floating-point square root half exact 


F.SQR.32 


Floating-point square root single 


F.SQR.32.C 


Floating-point square root single ceiling 


F.SQR.32.F 


Floating-point square root single floor 


F.SQR.32.N 


Floating-point square root single nearest 


F.SQP.32.T 


Floating-point square root single truncate 


F.SQR.32.X 


Floating-point square root single exact 


F.SQR.64 


Floating-point sauare root double 


F.SQR.64.C 


Floating-point square root double ceiling 


F.SQR.64.F 


Floating-point square root double floor 


F.SQR.64.N 


Floating-point square root double nearest 


F.SQR.64.T 


Floating-point square root double truncate 


F.SQR.64.X 


Floating-point square root double exact 


F.SQR.128 


Floating-point square root quad 


F.SQR.128.C 


Floating-point sauare root quad ceiling 


F.SQR.128.F 


Floating-point square root quad floor 


F.SQR.128.N 


Floating-point sauare root quad nearest 


F.SQR.128.T 


Floating-point sauare root quad truncate 


F.SQR. 128.x 


Floating-point square root quad exact 





op 


prec 


round/trap 


absolute 
value 


ABS 


15 32 64 128 


noneX 


float fronn 
integer 


FLOAT 


16 32 64 


NONE C F N T X 


128 


none 


integer 
from float 


SINK 


16 32 64 128 


none C F N T X 


increase 

format 

precision 


INFLATE 


16 32 64 


noneX 


decrease 

format 

precision 


DEFLATE 


32 64 128 


none C F N T X 


negate 


NEG 


16 32 64 128 


none X 


square root 


SQR 


16 32 64 128 


none C F N T X 
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Format 

F.op.prec. round rc=ra 



31 24 


23 18 


17 12 


11 6 


5 0 


1 F.prec 1 


1 " 1 


1 - 1 


1 1 


1 UNARY. 1 
1 round i 


8 . 


6 


6 


6 


6 



Descriotion 

The contents of the register or register pair specified by ra is used as the operand 
of the specified floating-point operation. The operation is rounded using the 
specified rounding option or using round-to-nearest if not specified. The resuh is 
placed in the register or register pair specified by rc. 

If a rounding option is specified, the operation raises a floating-point exception if a 
floating-point invalid operation, divide by zero, overflow, or underflow occurs, or 
when specified, if the result is inexact. If a rounding option is noi specified, 
floanng-pomt exceptions are not raised, and are handled according to the default 
rules ot IEEE 754. 

If F128 precision is specified, ra or rb or both refer to an aligned pair of registers, 
and a reserved instruction exception occurs if the low-order bit of these operands 
is set. 

Definition 

del FloatlngPointUnary(op.prec.rouncj.ra.rb.rc) as 
if op = F.FLOAT then 

a ^ RegRead(ra. 64) 

else 

a F(prec. RegRead(ra, (prec<64) ? 64 : prec)) 

endif 

case op of 
F.ABS: 

If a < 0 then 
c « — a 

else 

c a 

endif 
F.NEG: 
c -a 

F.SQR: 

c 

F.FLOAT, F.SiNK. F.INFLATE. F.DEFLATE: 
c a 

endcase 
case op of 

F.ABS. F.NEG. F.SQR. F.FLOAT: 

desiprec <— prec 
F.SINK 

destprec INT 
F.INFLATE: 
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destprec prec + prec 
F.DEFLATE: 

destprec <- prec / 2 

endcase 
case round of 

X: 
N: 
T: 
F: 

C: 

NONE: 
endcase 
case destprec of 

16. 32. 64. 128: 

RegWrlte(rc. (desiprec<64) : 64 : destprec. PackF{destprec.c)) 
RegWrile(rc. 64, c) 

endcase 
enddef 

Exceptions 

Reserved instrucuon 
Floating-point arithmetic 
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Group 

These instructions take two operands, perform a group of operations on partitions 
ot bits ih the operands, and catenate the results together . 

Ccersvcn cedes 



G.ADD.2 




G.ADD.4 




G.ADD.8 




G.ADD.16 


\:ii(jup duu uOUuieiS 


G. ADD. 32 


vjaiuup clUQ QUaQieiS 


G ADD 64 


oroup acG GCuGIS 




ofuup ana 


G ANDN 


ofuup cif](j noi 


G COMPRF^^ 1 


oroup ccmprsss Dits 


G COMPRF^^ P 


Group compress pecks 


G rOMPRF^c; a 


Group ccmcess nibbles 


G COMPRF^^ ft 


Group comcress bytes 


G COMPRF*^^ 1fi 


Group ccmpress doublets 


G COMPRESS "^7 


Group comcress quadlets 


G COMPRF^^ fi4 


Group compress octlets 


G DIV 64 


Group sioned divide octlets 


G FXPANin 1 


Group signed expand bits 


G EXPAND P 

wi«w/\t r^i>iL^.^ 


Group signed expand pecks 


G EXPAND 4 


uroup signea expand nibbles 


G.EXPAND.8 


Group signed expand bytes 


□.EXPAND. 16 


Group signed expand doublets 


G.EXPAND.32 


Group signed expand quadlets 


G.EXPAND.64 


Group signed expand octjet 


G.GATHER.2 


Group gather pecks 


G.GATHER.4 


Group gather nibbles 


G.GATHER.8 


Group gather bytes 


G.GATHER.16 


Group gather doublets 


G.GATHER.32 


Group gather quadlets 


G.GATHER.64 


Group gather octlets 


G.GATHER.128'6 


Group gaihe.- hexlets 


G.MUL.1'7 


Group signed multiply bits 


G.MUL.2 


Group signed multiply pecks 


G.MUL.4 


Group signec multiply nibbles 


G.MUL.8 


Group signed multiply bytes 



^"^G.AND does not require a size specification, and is encoded as G.AND.l, 

^^G.AXDN does not require a size speciticaiicn. and is encoded as G.ANDN.L G.ANDN is 

used as the encoding for G.SET.L.l, and bv reversing the operands, for G.SET.UL I 

lfG.GATHER.128 is encoded as G.GATHEKl 

I'G.MUL.l is used as the encoding for G.UMUL.l. 
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G.MUL.16 


Group signed multiply doublets 


G.MUL32 


Group signed multipiv ouadlets 


G.MUL.64 


Group signed multipiv octlets 


G.NAND' = 


Group nana 


G.N0R13 


Group nor 


G.OR20 


Group or 


G.0RN21 


Group or not 


G.POLY.1 


Group DOlynomial diviriP hire 


G.P0LY.2 


Group polvnomial dividp nprk<i 


G.P0LY.4 


UrOUD DOlvnomial ^\\f\c\e^ niKhloe 


G. POLY. 8 


GrOUD DOlvnomiAl diviHp h\ytoc 


G.P0LY.16 


GrOUD DOlvnnmi?^! HiwiriA rini ihlQfc 


G.POLY.32 




G. POLY. 64 


v-iiwu^ /I loi 1 If cli UlVlUc OCllGiS 


G.R0TL.2 


GrOUfi f^t?*^ Ipft norU'c 


G.R0TL.4 


Grnun rnf;^*^ loft niHKicie» 


G.R0TL.8 




G ROTL 16 


vjiuup iOlc:.3 lt:ll OOUDIcIS 


G.ROTL.32 


r^rniin rAf^*^ loft mioHi/^fc* 


G ROTL 64 


fnmi in rr\rp?A loft « -i — 
\j>fUU)J lUlciw icU OCII6IS 


G ROTL 128 


oiuuu loicL^ leu nexiGis 


G ROTR 2 


vjiuup roiair rigni pecKs 


G ROTR 4 

• 1 ivy t 1 1 . *T 


oroup roiciG rigni niDDIGS 


G ROTR fi 


oroup roifliz: rignt oytss 


G ROTR 16 


vjjiuup roisir rigni QOUDIGls 


G ROTR '^2 

vj.i ivy 1 11. oc 


oroup roiats rignt quaoiGts 


G ROTR fi4 


oiuup roiaiG rigni OCUGIS 


G ROTR 128 

V^ . 1 1 Vy III. I 


vjiuup ruicis iigni nGXiGiS 


G SCATTER 2 


r^roiiPi c/^:3Ttnr r*\Q/^l^c 

oiuup bCrciisr p©CKS 


G SCATTER 4 


Oiuup ouoLisr niDDicS 


G SCATTER 8 


vJlfUUp bCaUSr DylGS 


G SCATI EH 16 


VJliUUfJ bOdlloi UOUUIGiS 


G.SCATTER,32 


Group scaitGr quadlets 


G.SCATTER.64 


Group scatter octlets 


G.SCATTER.12822 


Group scatter hexlet 


G.SHL.2 


Group shift [eft pecks 


G.SHL.4 


Group shift left nibbles 


G.SHL.B 


Group shift left bytes 


G.SHL.16 


Group shift left doublets 


G.SHL.32 


Group shift left quadlGts 



^^G.NAND does not require a size specification, and is encoded as G.NAND.l. 
^^G.NOR does not require a size specification, and is encoded as G.NOR.L 
2^G.OR does not require a size specification, and is encoded as G.OR.L 

•^^G.ORN does not require a size specification, and is encoded as G.ORN.i. G.ORN is used as 
the encoding for G.SET.UGE.l. and bv reversing the operands, for G.SET.GE I 
2-G.SCATTER.128 is encoded as G.SCATTER.l 
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G.SHL.64 


Group shift left octlets 


G.SHL128 


Group shift left hexlets 


G.SHR.2 


Group signed shift right pecks 


G.SHR.4 


Group signed shift right nibbles 


G.SHR.8 


Group signed shift right bytes 


G.SHR.16 


Group signed shift right doublets 


G.SHR.32 


Group signed shift nght quadiets 


G.SHR.64 


Group signed shift right octlets 


G.SHR.128 


Group signed shift right hexlets 


G.U.DIV.64 


Group signed divide octlets 


G.U.EXPAND.1 


Group unsigned expand bits 


G.U.EXPAND.2 


Group unsigned expand pecks 


G.U.EXPAND.4 


Group unsigned expand nibbles 


G.U.EXPAND.8 


Group unsigned expand bytes 


G.U.EXPAND.16 


Group unsicned expand doublpN 


G.U.EXPAND.32 


GrouD unsicned exnand nn^diptc 


G.U.EXPAND.64 


GrouD unsianed pxn;^nri nrti^t 


G.U.MUL.2 


GrOUD UnSlQnpd mnltrnl\/ nor^U'c 


G.U.MUL.4 


GrOUD unsinnpd mitlUnlw niKKloe 


G.U.MUL.8 


GrouD unsinnpfi mnltinlv; h\/foc 


G.U.MUL16 


GrOUD nn*^innPfi miiltinK/ rlni iKI^^l-c- 


G.U.MUL.32 


GrouD unsinnpd mtiltinlv/ niioHiritc 


G.U.MUL.64 


GrouD unsionpd nniilfinK/ rirtiotc 


G.U.SHR.2 


Group unsigned shift right pecks 


G.U.SHR.4 


Group unsigned shift right nibbles 


G.U.SHR.8 


Group unsianed shirt right bytes 


G.U.SHR.16 


Group unsigned shift right Doublets - 


G.U.SHR.32 


Group unsigned shift right quadiets 


G.U.SHR.64 


Group unsigned shirt right octlets 


G.U.SHR.128 


Group unsigned shirt right hexlets 


G.XNOR23 


Group exclusive-nor 


G.XOR2'' 


Group exclusive-or 



•^G.XNOR does not require a size specification, and is encoded as G.XN0R.1. G.XNOR is 
used as the encoding for G.SET.E.L 

2-»G.XOR does not require a size specification, and is encoded as G.X0R.1. G.XOR is used 
the encoding for G.ADDJ, G.SUB.l and G.SET.NE.l. 
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class 


nn 


size 


linear 


ADD 


2 4 8 16 32 64 


UllwiS6 


AND ANDN NAND NOR 
Uri UnIM XNOR XOR 




siyneu rnuiiipiy 


MUL 


1 2 4 8 16 32 64 


unsigned 
rnuiiipiy 


U.MUL 


2 4 8 16 32 64 


signed divide 


DIV 


64 


unsigned 
divide 


U.DIV 


64 




GATHER SCATTER 


2 4 8 16 32 64 


galois field 


POLY 


1 2 4 8 16 32 64 


precision 


COMPRESS EXPAND 
U.EXPAND 


1 2 4 8 16 32 64 


shift 


ROTR ROTL SHR SHL 
U.SHR 


2 4 8 16 32 64 128 



Format 



G.op.size rc=ra,rb 

3j 24 23 18 17 12 11 65 0 

I G.size I ra I rb I rc I op n 

8 6 6 6 6 

Description 

Two values are taken from the contents of registers or register pairs specified bv 
ra and rb. The specified operation is performed, and the result is placed in the 
register or register pair specified by rc. 

A reserved instruction exception occurs if rco is set, and for cenain operations, if 
rao or rbo is set. 

Definition 

def Group(op.size.ra.rb.rc) 
case op of 

G.MUL. G.U.MUL. G.DIV. G.U.DIV: 

a RegReacJ(ra, 64) 

b ^ RegRGaci{rb. 64) 
G,ADD. G.SUB. G.SET.L G.SET.UL G.SET.E. G.SET.NE. G.SET.GE. G.SET UGE 
G.AND. G-OR. G.XOR. G.ANDN. G.NAND. G.NOR. G.XNOR G ORN 
G.GATHER. G. SCATTER: 

a RegRead(ra. 128) 

b ^ RegRead(rb. 128) 
G.COMPRESS. G.ROTL G ROTR. G.SHL. G.SHR. G.U.SHR. G.POLY: 

a 4r- RegRead(ra. 128) 

D 4- RegRead(rb. 64) 
G.EXPAND. G.U.EXPAND: 

a RegRead(ra. 64) 
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b RegRead(rt>. 64) 

endcase 
case op of 
G.ADD: 

for i 0 to 12S-size by sizs 

Ci-t-sizs-l-.i aj-^sizs- :..t + t)j+size-l..i 
endfor 
G.MUL: 

for i 0 to 64-size by size 

C2-{i^size).i..2-l ^ (asiz-i ^'^^ " ^siz^^^H.^) ' (bsize-i^^ze h bsiz^-ui i) 
endfor 
G.U.MUL: 

for i 0 to 64-size by size 

C2^(i+siz9).i..ri ^ (O^'^s II asize-i^L.) ' (QS'ze „ bsize-ui «) 
endfor 
G.DIV: 

if (b = 0) or { (a = (1II063)) and (b = i^^) ) then 
c 4- undefined 

else 

q a/b 
r <- a - q'b 

C r63..o II q63,.0 

endif 
G.U.DIV: 

if b = 0 then 

c 4- undefined 

else 

q ^ (0 II a) / (0 II b) 

r <~ a - q'b 

C <- r53..o II q63..0 

endif 



G.AND: 




c <- 


a and b 


COR: 




c e- 


a or b 


G.XOR: 




C 


a xor b: 


G.ANDN: 




c 


a and not b 


G.NAND: 




C 4- 


not (a and b) 


G.NOR: 


C f- 


not (a or b) 


G.XNOR: 


c <- 


not (a xor b) 


G.ORN: 


c <- 


a or not b 


G.POLY: 




P[0] 


a 


for i <- 1 to size 



p[i] (p[i-1]o ? (0^4 II b) : Ol2S) xor (p(i-i]o II p[i-1)i27..i) 
endfor 
c ^ p[slze] 
G.GATHER: 

for k <- 0 to 128-size by size 

for i 4- k to k+size-1 by 1 
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if a; then 
Cj ^bi 

+ 1 

endif 
endfor 
j k+size-1 

for i k+size-l to k by -l 
if -aj then 
Cj <-bi 

endif 
endfor 
endfor 
"G.SCATTER: 

for k 0 to 128-size by size 
j^k 

for i <- k to k+size-1 by 1 
if a; then 

Ci<- bj 

)<-]+! 

endif 
endfor 
j<~ k+size-1 

for i k+size-1 to k by -1 
if -Bj then 
Cj f-bj 

endif 
endfor 
endfor 
G.COMPRESS: 

for i <~ 0 to 64-size by size 

Ci^sl2e-l..i ai^j^.si2e-:-(b&(3ize-1))..i^i+(bfii(sizS'l)) 
endfor 
G.EXPAND: 

for I 0 10 64-size by size 

endfor 
G.U.EXPAND: 

for i <- 0 to 64-size by size 

Ci.i+size+size.l.J+i ^ o'^"--^^^^^'^^-^^^ll ai^siz9.ijH0^^^^'^-''» 
endfor 
G.ROTL: 

. for i 0 to 128-size by size 

Ci+size-i .l ^ ai+size-i-(fc&{size-i)}.j II ai+size-i..i+size-i-(b&(size-':)) 
endfor 
G.ROTR: 

for i <- 0 to 128-size by size 

Ci+size-i..i <- ai+(b&(sizs-i))-i.j" H ai+sjze-i..i-(b&(size-i)) 
endfor 
G.SHL: 

for I ^ 0 to 128-size by size 

Cii-size-U ^ a,>si2e-i-{b&(s.z-i))..i H Ob<S^(size-l) 
endfor 
G.SHR: 



143 



Case 2:05-cv.00505-TJW Document 149 Filed 10/15/2007 Page 27 of 40 

wo 97/07450 PCT/US9M3047 

for i 0 to 128-si2e by size 

Ci+size'l..i *- ai+size-l^^^^'^-'^>ll ai+si2e-i..U(t)&(size-i)) 
endfor 
G.U.SHR: 

for i 0 to 128-size by size 

c,Vsiz9.iJ ^ OWsizs-i)„ ai,size.i.j^(b&(size-i)) 
endfor 

endcase 
case op of 

G.ADD. G.MUL G.UMUL 6.DIV. G UDIV: 

G.AND. G.OR. G.XOR, G.ANDN. G.NAND. G.NOR, G XNOR G ORN 
G.EXPAND G.U.EXPAND. G.SHL. G.SHR. G.U SHR 
G.GATHER. G.SCATTER. G.POLY: 

RegWrite(rc. 128. c) 
G.COMPRESS: 

RegWrite{rc. 64. c) 

endcase 
enddef 

Exceptions 
Reserved Instruction 
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Group Extract Immedi^tR 

These operations perform calculations with two general register pair values and a 
small immediate field, placing the result in a third general register pair. 



Oce ration codes 



G.EXTRACT.1.1 


CiroUD Sionsd ^xtrart immAHiftta hitc 


G.EXTRACT.1.2 


Group signea extract immediate pecks 


G.EXTRACT.1.4 


Group siqned extract immediate nibbles 


G. EXTRACT.!. 8 


Group signed extract immediate bytes 


G. EXTRACT. 1.16 


Group signed extract immediate doublets 


G.EXTRACT.1.32 


Group signed extract immediate quadlets 


G.EXTRACT.1.64 


Group signed extract immediate octlets 


G.EXTRACT.1.1 28 


Group signed extract immediate hexlet 


G.UEXTRACT.1.1 


Group unsigned extract immediate bits 


G.UEXTRACT.1.2 


Group unsigned extract immediate pecks 


G.UEXTRACT.1.4 


Group unsigned extract immediate nibbles 


G.UEXTRACT.1.8 


Group unsigned extract immediate bytes 


G.UEXTRACT.1.16 


Group unsigned extract immediate doublets 


G.UEXTRACT.1.32 


Group unsigned extract immediate quadlets 


G.UEXTR ACT. 1.64 


Group unsigned extract immediate octlets 


Q.UEXTRACT.1.128 


Group unsigned extract immediate hexlet 



Format 



G.EXTRACT.I.size rc=ra.rb.shift 

3J 24 23 18 17 12 11 65 0 

\ op I ra i rb I rc I ishifta 1 

8 6 6 6 6 

Descriotion 

The contents of registers pairs specified by ra, and rb are fetched. The specified 
operation is performed on these operands. The result is placed into the register 
pair specified by rc. 

Definition 

def GroupExiractlmmediate(op,ra.rb.rcjshiUa) as 
a RegRead(ra. 128) 
b ^ RegRead(fb. 128) 
ab <~ a II b 

opimm ^ OP2..0 H ishifta 
case opimm of 
0..2: 

raise Reservedlnstruction 

3..5: 

size 1 
6. 11: 
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size 4- 2 
12.. 23: 

size 4 
24..47: 

size <- 8 
48..95: 

size <- 16 
96.. 191: 

size <- 32 
192.-383: 

size ^ 64 
384.,511: 

size <- 128 

endcase 

shift <~ opimme. o & (size+size-1) 
if shift > size then 

sex ^ (opimm & (size+size)) ^ 0 

if sex then 

for i <- 0 to 128-size by size 

C|4.size-i..i ^M^i^-BiZB-^ "^'^i^i^si2e-^size.i..i*i+shifi 
endfor . 

else 

• for i 0 to 128-size by siz= 

Ci,3i2e-iJ <-0S^^'^^-^2^ll ab.,,,3,2e.si29-i..i.i.shift 
endfor 

endif 

else 

for i <-r 0 to 128-size by size 

Ci+size-L.i abi+i+shift^siz-.l .i+i-rshift 

endfor 

endif 

RegWrite(rc. 128. c) 
enddef 

Exceptions 
Rescrx'ed instruciion 
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Group Field Immediate 

These operations perform calculations with one or two general register values and 
two immediate values, placing the result in a general register. 



Ooeration codes 



G.DEP 1 


Group deposit immediate 


G.MDEP.I 


Group merge deposit immediate 


G.UDEP.I 


Group unsigned deposit immediate 


G.UWTH.I 


Group unsigned withdraw immediate 


G.WTH.I 


Group withdraw immediate 


Format 




op.size rb=ra.ishift.isize 

,^1 24 23 18 17 12 11 6 5 n 


1 op 1 


i rb 1 ishifta | isizea i 


8 . 


6 6 6 6 


Descriotion 





The contents of register ra, and if specified, the contents of register rb is fetched, 
and 6-bit immediate values are taken from the 6-bit ishifta and isizea fields. The 
specified operation is performed on these operands. The result is placed into 
register rb. 

Definition 

def GroupFieidlmmediate(op.ra.rb.ishifta.isizea) as 
a RegRead(ra. 64) 
case (ishifta & isizea) of 
0..31: 

size 64 
32..47: 

size 32 
48. .55: 

size <~ 16 
56. .59: 

size «- 8 
60..61: 

size 4 

52: 

size <- 2 

63: 

size 1 

endcase 

ishifi <- ishifta & (size-l) 
isize f- (isizea & (size-1))+1 
if (ishift+isize>sire) 

raise Reservedlnstruction 

endif 
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case op of 
G.DEPl: 

for i 4^ 0 to 128-si2e by size 

h- • * . X- nSize -isize -iShift ,. ^ n Ai<:hift 

D|+size-l.j ^,+jsiz9-1 " ^'+»size-l..i 1* ff^*^*" 

endfor 
G.UDEPI: 

for i 0 to 1 28-si2e by size 

bi^slze-u ^ QS'^e-isize-ishift „ ai^isi^e-ij II 0'^^'^^ 
endfor 
G-MDEPI: 

m <- RegRead(rb. 128) 
for i <- 0 to 128-size by size 

bi+slzeo..i <- n^i+size-l..i+isize+ishift 1^ 3i+jsi2e.i..r H nij+ishitt.i.j 
endfor 
G.WTHI: 

for i 0 to 128-size by size 

endfor 
G.UWTHI: 

for i f- 0 to 128-size by size 

bi+size-i..i <- 0S«2e-isfze n a,^isizs-.ishifi-i.j>ishin 
endfor 

endcase 

RegWrite(rb. 128. b) 

enddef 

Exceptions 
Reserved instruction 
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Group inolRn(=i 

These operations perform calculations uith three general register values. placLnc 
the result in the third general register. F"«-"'o 



Ooeratinn codes 
G.MSHR.2 



Group merge shift right pecks 
Group merge shift right nibbles 
Group merge shift right bytes 



G.MSHR.4 
G.MSHR.8 



G.MSHR.16 
G.MSHR.32 



Group merge shift right bytes 
Group merge shift right doublets 
Group merge shift right quadlets 

rirmin mama 



G.MSHR.64 



G.MSHR.128 



Group merge shift right octlets 



Group merge shift right hexlets 



Fprrnat 

G.MSHR.size rc=ra.rb,rc 

^1-— ^123 ^B^7 12 1 1 6 5 n 

I ^'^'^^ ' I rb I rc I op I 

8 6 6 6 6 

Descriotinn 

rt cnl^'r register pairs specified bv ra and rc and register rb are fetched, 
ihe specified operation is performed on these operands. The result is placed into 
the register pair specified by rc. 

A reser\'ed instruction exception occurs if rao or rco is set. 

Definition 

def GroupTernarylnplace{op.ra»rb,rc) as 
a RegRead(ra. 128) 
b RegRead(rb. 64) 
c <- RegRead(rc. 128) 
case op of 

G.MSHR: 

for i ^ 0 to 128-5129 by size 

di+si2e-1..i ^ Ci-^si2e-l..i.i.|^si2e.l-(b&(size-i)) » ai+si2e.i..i+(b&(si29-i)) 
endfor " 

endcase 

RegWrite{rc. 128. d) 
enddef 

Exceptions 
Reserved Instruction 
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Group Reversed 

These operations take two values from a pair of registers, perform operations on 
groups of bits in the operands, and place the concatenated results in a register. 



Oo^ra^'cn codes 



G.SET.E.2 


GrouD S9t eoual Dscks 


G.SET.E.4 


GrouD SPt nihhiPQ 


G.SET.E.8 


GrouD SPt pni]pl hvfp^ 


G.SET.E.16 


rnfnim QPt por *ol HriiiKlcifc 


G SET E 32 


f^roi ir^ cot ^rviiol /^i 1 oWI/^kr* 
vjiiuup ocl t^U'Jdi C^UaUiBlS 


G SET F 64 


otuup set sGuai ociiGis 


G SET GF 9 


oroup sei :iigneo greater or epuat pecks 


G SFT GP d 


Group set signed greater or eoual nibbles 


G SFT GF ft 


Group set signed Greater or equal bytes 


G ^FT riF 1R 


Group set signed greater or equal doublets 


G "^FT r^F '^9 


Group set signed Greater or equal quadlets 


G SET GF fi4 


Group set sioned greater or equal octlets 


G SET L ? 


oroup set signed less pecks 


G SET L 4 


^ruup set signeo less nibbles 


G SET L 8 


Group set signed less bytes 


G SET L 16 


oruup set signea less doublets 


G SET 1 "^P 


Group set signed less quadlets 


G SET L fi4 


Group set signed less octlets 


G SET NF ? 


Group set not equal pecks 


G SET NE 4 


vjruup set noi equal niDDies 


G SET NE 8 


vjiiuup iiUl cQUai uyieS 


G.SET.NE.16 


vjivju^/ i)t;i liul t^UUdl UOUUlcTS 


G.SET.NE,32 


vjjiv^ufj 3d iiui tr^Ual qUaUlclS 


G.SET.NE.64 


Grniin ^Pt nnt oniial nntlnfo 


G.SET.UGE.2 


Group set unsigned greater or equal pecks 


G.SET.UGE.4 


Group set unsigned greater or equal nibbles 


G.SET.UGE.8 


Group set unsigned greater or equal bytes 


G.SET.UGE.16 


Group set unsigned greater or equal doublets 


G.SET.UGE.32 


Group set unsigned greater or equal quadlets 


G.SET.UGE.64 


Group set unsigned greater or equal octlets 


G.SET.UL2 


Group set unsigned less pecks 


G.SET.UL4 


Group set unsigned less nibbles 


G.SET.UL8 


Group set unsigned less bytes 


G.SET.UL.16 


Group set unsigned less doublets 


G.SET.UL.32 


Group set unsigned less quadlets 


G.SET.UL.64 


Group set unsigned less octlets 


G.SUB.2 


Group subtract pecks 


G.SUB.4 


Group subtract nibbles 


G.SUB.8 


GrouD subtract bytes 
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G.SUBJ6 


Group subtract doublets 


G.SUB.32 


Group subtract quadlets 


G.SUB.64 


Group subtract octlets 



class 


op 


size 


linear 


SUB 


2 4 


8 


16 


32 


64 


boolean 


SET.E 
SET.NE 


SET.L 
SET.UL 


SET.GE 
SET.UGE 


2 4 


8 


16 


32 


64 


Format 


















G.op.size 

31 


rc=rb,ra 

24 23 


18 


17 12 11 


6 i 






0 




G.size 


i 


ra 


1 rb 1 


« 1 




op 




1 


8 




6 


6 


6 




6 







Descrlotion 

Two values are taken from the concents of registers ra and rb. The specified 
operation is performed, and the result is placed in register rc. 

Definition 

def GroupReversedCcp.size.ra.rb.rc) 
a RegRead(ra. 128) 
b 4- RegRead(fb. 128) 
case op of 
G.SUB: 

for i <- 0 to 1 28-si2e by size 

Ci.».srZ9-l..r ^'bi+si29-1..i ' 3i+S»ze-l..i 
endfor 
G.SET.L: 

for i 0 to 128-size by size 

Cii-size-i-.i (bi+size-i.j < ai+size-i..i)®'^® 
endfor 
G. SET.UL: 

for i 0 to 128-size by size 

Ci+size-i..i (0 II b^siZ9-i..i < 0 H ai+sizs-i..i)^'^® 
endfor 
G.SET.E: 

for i <- 0 to 128-size by size 

Cii-size-l-.i *" (bi+size-l..i = 3i+size-l..i)^'^^ 
endfor 
G.SET.NE: 

for i <- 0 to 128-size by size 

ci-tsize-i..i (bii-sizs-i.-i * aj+size-i.j)^'^^ 
endfor 
G.SET.GE: 

for i <- 0 to 128-size by size 

Ci+size-l..f (t)i+siZ9-1..i ^ aii-size-l.-i)^'^^ 
endfor 
G.SET.UGE: 
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for i 4- 0 to 128-size by size 
endfor 

endcase 

RegWrite(rc. 128. c) 
enddef 

Exceoticns 
Resented Instruction 
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Group Short ImmediRtR 

These operacions take operands from a pair of registers, perform operations on 
groups ot bus in the operands, and place the concatenated results in a reoister or 
pair ot registers. ° 



Oceraticr cedes 



G.COMPRESS.1.1 


GroUD CCmcr'^SS wnmt^rW^xe^ hitc 


G.COMPRESS.1.2 


GrouD cornor^*^^ immArliato norU-c 


G.COMPRESS.1.4 


wufMf iy#sji I i^f t;do in If 1 it^Uldlt? iillJUIBS 


G.COMPRESS.1.8 


\^t\j\jyj v^vjiiiu'wio HI If 1 icUl die DyiBS 


G.COMPRESS.1.16 


vjiiuup uumpJvTbb irnrneuldiG Q0UDI61S 


G COMPRESS 1 3? 


oruup uumpr^ss immGQiaie QuadlGts 


G COMPRESS 1 64 


vjroup cumprGss immeaiat6 octlet 


G FXPAND 1 1 


Group Signed expand immediate bits 




Group signea expand immediate pecks 


G FXPAND 1 d 


Group Signed expand immediate nibbles 




Group Signed expand immediate bytes 




Group signed, expand immediate doublets" 


G PXPAMD 1 '^9 


Group signea expand immediate quadlets 


. L_/\ J /^i ML./. 1 . D*T 


Group signed expand immediate octlet 


G ROTR I 0 


Group rotate right immediate pecks 


G ROTR 1 4 


Group rotate right immediate nibbles 


G ROTR 1 A 


Group rotate right immediate bytes 


G ROTR 1 1R 


Group rotate right immediate doublets 


G ROTR 1 '^P 


uroup roiattr right immediate Quadlets 


G ROTR 1 64 


vjiuup roiait; ngni immeoiaie octiets 


G ROTR \ IPft 


diuup rojaie rfgni immeoiaie nexieis 


G.SHL.1.2 


oi(jup bniiL leii immeoiaie pecKS 


G.SHL.1.4 


Group shift left immediate nibbles 


G.SHL.1.8 


Group shift left immediate bytes 


G.SHL.1.16 


Group shift left immediate doublets 


G.SHL.1.32 


Group shift left immediate quadlets 


G.SHL.1.64 


Group shift left immediate octiets 


G.SHL.1.128 


Group shift left immediate hexlets 


G.3HR.I.2 


Group signed shift right immediate pecks 


G.SHR.1.4 


Group signed shift right immediate nibbles 


G.SHR.I.B 


Group signed shift right immediate bytes 


G.SHR.1.16 


Group signed shift right immediate doublets 


G.SHR.1.32 


Group signed shift right immediate quadlets 


G.SHR,I,54 


Group signed shift right immediate octiets 


G.SHR.1.128 


Group signed shift right immediate hexlets 


G.SHUFFLE.I 


Group shuffle immediate 


G.SHUFFLE.I.4MUX 


Group shuffle immediate and 4-way multiplex 


G.U.EXPAND.1.1 


Group unsigned expand immediate bits 
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G.U.EXPAND.1.2 


Group unsianed expand immftfliatp npri<c 


G.U.EXPAND 1.4 


Group unsigned expand immediate nibbles 


G.U.EXPAND.1.8 


Group unsigned expand immediate bytes 


G.U.EXPAND.1.16 


Group unsigned expand immediate doublets 


G.U.EXPAND.1.32 


Group unsigned expand immediate auadlpt*; 


G.U.EXPAND.1.64 


Group unsigned expand immediate octlet 


G.U.SHR.1.2 


Group unsigned shift right immediate pecks 


G.U.SHR.1.4 


Group unsigned shift right immediate nibbles 


G.U.SHR.1.8 


Group unsigned shift nght immediate bytes 


G.U.SHR.1.16 


Group unsigned shift nght immediate doublets 


G.U.SHR.1.32 


Group unsigned shift nght immediate quadlets 


G.U.SHR.1.64 


Group unsigned shift nght immediate octlets 


G.U.SHR.1.128 


Group unsigned shift right immediate hexlets 



class 


op 


Size 


precision 


COMPRESS.I 


EXPAND.I 
U.EXPAND.I 


1 2 4 8 16 


32 


64 


shift 


ROTR.I 
SHR.I 


SHL.I 
U.SHR.I 


2 4 8 16 


32 


64 128 


Formstt 












G.op.size 


rb=ra.simm 










31 


24 23 18 


17 12 11 fi s 




0 


i G.size 


1 ra 


1 rb 1 


simm 1 




op 1 


8 


6 


6 


6 




6 



Dsscriotinn 

A 128-bii value is taken from the contents of the register pair specified by ra. The 
second operand is taken from simm. The specified operation is performed, and the 
result is placed in the register pair specified by rb. 

This instruction is undefined and causes a reserved instruccion exception if the 
simm field is greater or equal to the size specified. 

Definition 

def GroupShortlmmedtate(op.si2e,ra.rb.simm) 
case op of 

G. COMPRESS.!. G.U.COMPRESS.I: 
a «~ RegRead(ra. 128) 
if simm>si2e then 

raise Reservedlnstructlon 

endif 

G.ROTR.I. G.SHL.I. G.SHR.I. G. U.SHR.I: 
a RegRead(ra. 128) 
shift <- opo II simm 
if shift>size then 
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raise Reservedtnstruction 

endif 

G.EXPAND.L G.U.EXPAND.I: 
a RegRead(ra. 64) 
shift opo n srmm 
if Shift>si2e+si2e then 

raise Reservedlnstruction 

endif 

endcase 
case op of 

G.COMPRESS.l: 

for i 4- 0 to 64-si2e by size 

t^in-Size-l-.i 3i^.j+sj23.i+3imrr>..t+i>simm 
endfor 
G.EXP AND.I: 

for i 4- 0 to 64-size by size 

bi.l*si2e*si2e.l..ki ^ " assize -ijI'O^"' 

endfor 
G.U.EXPAND.I: 

for i <- 0 to 64-si2e by size 

bi^i+size+sizs.v.i+i 0"'^-"^^'''nai+siz9-ij"0^^^'^ 
endfor 
G.SHLI: 

for i 0 to 128-8126 by size 

bi+si29.i..i *~ ai+siz9---sr5iti..j" 0^^'" 
endfor 
G.SHR.I: 

for i 4- 0 to 128-size by size 

h- ^ ■ x_ -.Shift 11 

DI-.SI29-1..I *- aj^si2e-i"a'+si2e-ij+shift 

endfor 

G.U.SHR.l: 

for i 0 to 128-sl2e by size 

bi-KsiZ9-1..i 0^^*^ ai*5ize-tj+shift 
endfor 

endcase 
case op of 

G.EXPAND.l. G.U.EXPAND.I. G.SHL.I. G.SHR.I. G.U.SHR.l: . 

RegWrite(rb. 128. b) 
G.COMPRESS.l: 

RegWrite(rb, 54. b) 

endcase 
enddef 

Exceptions 
Reserved Instruccion 
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Group Short Immediate InolanR 

These operations take operands from two register pairs, perform operations on 
groups of bits m the operands, and place the concatenated results in the second 
register pair. 



Ooeration codes 



G.MSHR.1.2 


Group merge shift right immediate pecks 


G.MSHR.1.4 


Group merge shift right immediate nibbles 


G.MSHR.1.8 


Group merge shift right immediate bytes 


G.MSHR.1.16 


Group merge shift right immediate doublets 


G.MSHR.1.32 


Group merge shift right immediate quadiets 


G.MSHR.1.64 


Group merge shift right immediate octlets 


G.MSHR.1.128 


Group merge shift right immediate hexlets 



Format 



G, op. size rb=ra.simm 

3J 2A 23 18 17 12 11 65 0 

I G.size I ra I rb I simm | op | 

8 6 6 6 6 

Description 

Two 128-bit values are taken from the contents of the register pairs specified by ra 
and rb. A third operand is taken from simm. The specified operation is performed, 
and the result is placed in the register pair specified by rb. 

This instruction is undefined and causes a reserved instruction exception if the 
simm field is greater or equal to the size specified. 

Definition 

def GroupShortlmmediatelnp!ace(op.si2e.ra.rb.simm) 
a RegRead(ra. 128) 
b<- RegRead(rb. 128) 
shift opo II simm 
if shift>size then 

raise Reservedlnstruction 

endif 
endcase 
case op of 

G.MSHRJ: 

for i 0 to 128-si2e by size 

Ci+si29-i..i ^ bi-t-size-i-i-Ksize-l-shift " ai+sizs-iJ-hshilt 
endfor 

endcase 

RegWrite(rb. 128. c) 
enddef 
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Group Shuffle Immediate 

These operations take operands from a pair of registers, perform operations on 
groups of bits in the operands, and place the concatenated results in a register or 
pair of registers. ^ 



Operation codGs 



IG.SHUFFLE.I 


Group shuffle immediate | 


|e.SHUFFLE.I.4MUX 


Group shuffle immediate and 4-way multiplex 



Format 



op.a.b.c rc=ra.rb 

31 23 13 17 12 11 6 5 0 

I op I ra I rb I rc I simm 1 

8 6 6 6 6 

Descriotion 

A 128-bit value is taken from the contents of the register pair specified by ra. The 
second operand is taken from simm. The specified operation is performed' and the 
result is placed in the register pair specified by rc. 

This instruction is undefined and causes a reserved instruction exception if the 
simm field is greater or equal to the size specified. 

Definition 

def GroupShufflelmmediate(op.ra.rb.rc.simm) 
case op of 

G.SHUFFLE.I: 

a RegRead(ra. 64) 
b RegRead(rb, 64) 
G.SHUFFLE.I.4MUX: 

a «- RegRead(ra. 128) 
b <- RegRead(rb. 128) 

endcase 

if simm^size then 

raise Reservedlnstruction 

endif 

case op of 

G.SHUFFLE.I: 
ab a II b 
case simm of 
0: 

c ab 

1..56 

for X <- 0 to 7: for y <- 0 to x-1 : for z «- 1 to x-y 

if simm =; ((x*x'x-3'x"x-4'x)/6-(z*z-2)/2+x'z+y+1) then 
for i<- 0 to 127 
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end 

endif 

endfor: endfor: endfor 
57..255: 

raise Reservedlnstructfon 

endcase 
G.SHUFFLE.L4MUX: 
case simm of 
0: 

1..55: 

for X 0 to 7: for y ^ 0 to x-1 : for z <- 1 to x-y 

if simm = ((x-x'x-3*x-x-4"x)/6-(z'z-z)/2+x*2+y+1) then 
for i 0 to 127 

end 

endif 

endfor: endfor: endfor 
57. .255: 

raise Reservedlnstruction 

endcase 

for i <- 0 to 127 

endfor 

endcase 

RegWrite(rc. 128. c) 
enddef 

Exceptions 
Resen'ed Instrucuon 
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Group Swizzle Immediate 

These operations perform calculations vx^ith a general register or register pair 
value and c\vo immediate values, placing the result in a general register pair. 



Ooeraticn codes 



G.SWIZZLE.I 


Group swizzle immediate 


G.SWIZZLE.I.COPY 


Group swizzle immediate with copy 


G.SWIZZLE.I.SWAP 


Group swizzle immediate with swap 



Format 



op rb=ra.icopyjswap 

31 24 2> 18 17 12 11 6 5 0 

i I ra I rb I icopya | isw apaH 
8 6 6 6"" ^ ' 

Descrintion 

The contents of register ra. or if specified, the contents of the register pair 
specified by ra is fetched, and 6-bit immediate values are taken from the 6-bit 
icopya and iswapa fields. The specified operation is performed on these operands. 
The result is placed into the register pair specified by rb. 

Deffnition 

def GroupSwl2zle(op.fa.rb.icopya.iswapa) as 
case op of 

G.SWIZZLE.I: 

a ^ RegRead(ra. 128) 

icopy 4- 0 II Icopya 

iswap «~ 0 II iswapa 
G.SWIZZLE.I.COPY: 

a <- 0^^ II RegRead(ra. 64) 

icopy 4— 1 II icopya 

iswap ^ 0 II iswapa 
G.SWIZZLE.I.SWAP: 

a 4- R9gRead{ra. 128) 

icopy 4— 1 II icopya 

iswap «- 1 II iswapa 

endcase 

for I 4- 0 to 1 27 

- bi a(i & icopy) A iswap 
endfor 

RegWrite{rb. 128. b) 
enddef 

Reserved instruction 
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Group Ternary 

These operations perform calculations with three general register values, pi; 
the result in a fourth general register. 



Ooersticr codes 



G.8MUX 


GrouD S-w^v mtiltiniPY 


G.EXTRACT.128 


Rrnnn pytrs;rt hpylot 

x>^iVJU^ OAllGV^i liCAJCl 


G.MULADD.125 


fimuri QinncH mnlfiriK/ Kite* >^r\ri ^^^1^*^ 

:>iuiiwj ijiuuipiy uiis ana aou pecks 


G.MULADD.2 


Group signsd multiply pecks and add nibbles 


G.MULADD.4 


Group signed multiply nibbles and add bytes 


G.MULADD.B 


Group signed multiply bytes and add doublets 


G.MULADD.16 


Group signed multiply doublets and add quadlets 


G.MULADD.32 


Group signed multiply quadlets and add octlets 


G.MULADD.64 


Group signed multiply octlets and add hexlets 


G.MUX 


Group muiiiDlex 


G.SELECT.8 


Ciroup sele::r bytes 


G.TRANSP0SE.8MUX 


Group transcose and 8-way multiplex 


G.U.MULADD.2 


Group unsigned multiply pecks and add nibbles 


G.U.MULADD.4 


Group unsigned multiply nibbles and add bytes 


G.U.MULADD.8 


Group unsigned multiply bytes and add doublets 


G.U.MULADD.16 


Group unsigned multiply doublets and add quadlets 


G.U.MULADD.32 


Group unsigned multipfy quadlets and add octlets 


G.U.MULADD.64 


Group unsigned multiply octlets and add hexlets 



class 


op 


size 


extract 


EXTRACT 


128 


signed multiply 
and add 


MULADD 


1 2 4 8 16 32 64 


unsigned 
multiply and 
add 


U.MULADD 


2 4 8 16 32 64 


multiplex 


8MUX 

TRANSP0SE.8MUX 


NONE 


select 


SELECT 


8 



Format 



G. op. size rd=ra,rb,rc 

3J 24 23 18 17 1 ? n 6 5 0 

I op.size I ra I rb I rc I rd 

8 6 6 5 6 



-5g. MULADD. 1 is used as ihe encoding for G.UMULADD.l. 
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Description 

The contents of registers or register pairs specified by ra» rb, and rc are fetched. 
The specified operation is performed on these operands. The result is placed into 
the register pair specified by rd. 

Definition 

def GroupTernary{bp.size.ra.rb.rc.rd) as 
case op of 

G.MUX: 

a <- RegRead(ra. 128) 

b RegRead(rb. 128) 

c RegReadjrc. 128) 
G.EXTRACT. G.8MUX. G.TRANSP0SE.8MUX: 

a <- RegRead(ra. 128) 

b RegRead(rb. 128) 

c 4- R6gRead(rc. 64) 
G.MULADD: 
G.U.MULADD: 

a RegRead{ra. 64) 

b RegRead(rb. 64) 

c RegRead(rc. 128) 
G.SELECT: 

a RegRead(ra. 64) 

b <- RegRead(rb. 64) 

c RegRead(rc. 64) 

endcase 
case op of 
G.MUX: 

d (b and a) or (c andnot a) 
G.8MUX: 

fori*-0lo127 

endfor 
G.TRANSP0SE.8MUX: 
for t «- 0 to 127 

endfor 

for i *- 0 to 127 

endfor 
G.EXTRACT: 

d <- (a II b)(c&i27)+l27..(c&127) 
G.MULADD: 

for I 0 to 64-slze by size 

d2-(ii-si2e)-i..2-i C2-(i-^size)-1..2-i + 

(asize-i^'^- » asize-Uij) * (bsize-i^'^^ " bs,ze-ui..i) 

endfor 
G.U.I^IULADD: 

for i 0 to 64-size by size 

d2*(i+size)-1..2*i ^C2-(i+size)-1..2-i + 

(Os^ze II asi2e-ui..i) ' (0^'^® " bgize-i^ij) 

endfor 
G.SELECT: 
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ab 4- a 11 b 
for i 0 to 15 

j C4-j+3, 4-j 

endfor 

endcase 

RegWrite(rd. 128. d) 
enddef 

Exceptions 
Reserved instruction 
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Group Floating-point 

These operations take two values from registers, perform floating-point arithmetic 
on groups ol bits m the operands, and place the concatenated results in a register. 



Operation rndR.^ 



GF.ADD.16 


Group floating-point add half 


GF.ADD.16.C 


Group floating-point add half ceiling 


GF.ADD.16.F 


Group floatino-Doint add half fionr 


GF.ADD.16.N 


Group floatina-Doint add half npsirocr 


GF.ADD.16.T 


CirOUD floatiPG-DOint arirl half tmnnato 


GF.ADD.16.X 


GrouD flostina-DOint ^^ciri half ovar-t 


GF.ADD.32 


GrouD floatmn-nnint add cinnio 


GF.ADD.32.C 




GF.ADD.32.F 


iiudwi I ly-puii 11 auu SIDylS TlOOr 


GF.ADD.32. N 


vjiuu^; Huaiii .y-pulf u dua SIRyie nGar6St 


GF.ADD.32. T 


\^\uu\j MUdji ly-puiiu dou Single uuncatG 


GF.ADD.32.X 


uuu nudi.! ly-ijuiru aUQ StnylG 6XdCl 


GF.ADD.64 


M wdi.My-|JUii 11 dCJU vJOUO'6 


GF.ADD.64 .C 


yj\uuyj nuGUny-poini duQ uOUDie CGilinQ 


GF,ADD.64 .F 


^jiuup Jiuaiii ly-puii H oUU QOUOIc TlOOt 


.GF.ADD.64 .N 


vjiuufj iiudur iy-|joini aua OOUDIG HGarGSt 


GF.ADD.64 .T 


oiuup iiudiing-poini aua uoudig IruncatG 


GF.ADD.64 .X 


yji\ju\j uudUl ly-puiiu duu uOUDIG GX3Cl 


GF.DIV.16 


Rroiin f !n?ifinn«n^int Hi\/iHa half 


GF.DIV.16.C 


NwJt^up iiudLniy *p(jii u uiviue nan CGiiinQ 


GF.DIV.16. F 


GrOUD flostinn-nnint di\/iH0 half flnr^r 


GF.DIV.16.N 


GrouD floatinQ-Doint riividp half npamct 


GF.DIV.16.T 


GrOUD flOatina-DOint riividP hAlf tnmratc 


GF.DIV.16.X 


Group floatina-Doint divide half pxact 


GF.DIV.32 


Group floating-point dividG singiG 


GF.DtV.32.C 


Group floating-point dividG singiG CGilino 


GF.DIV.32.F 


Group floating-point dividG singiG floor 


GF.DIV.32.N 


Group floating-point dividG singiG nGarGst 


GF.DIV.32.T 


Group floating-point divide single truncate 


GF.DIV.32.X 


Group floating-point dividG singiG Gxaci 


GF.DIV.64 


Group floating-point divide double 


GF.DIV.64.C 


Group floating-point divide double ceiling 


GF.DIV.64. F 


Group floating-point dividG double floor 


GF.DIV.64.N 


Group floating-point divide double nearest 


GF.DIV.64.T 


Group fioating-point divide double truncate 


GF.DIV.64.X 


Group floating-point divide double exact 


GF.MUL.16 


Group floating-point multiply half 


GF.MUL.16.C 


Group floating-point multiply half ceilino 


GF.MUL.16.F 


Group fioating-point multiply half floor 
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GF.MUL.16.N 


GrOUD floatinQ-DOint mnltinlw h;5lt noarocr 


GF.MUL.16.T 


CirOUD tlOatlna-Doint mnltmlw h?ilf trMnrato 
« v«* 1 1 • >v^ ^/wiiii liivjiii^iy Mull uuiioaic 


GF MUL.16.X 


GrOUD fioatino-DOint multinlw half pvart 


GF.MUL.32 


GrouD floatina-Doint mnltinlv <iinnio 


GF.MUL.32.C 


CafOUD floatino-noinf miiltinlv/ cinnlo noilinn 


GF.MUL.32.F 


GrouD floatinn-onint mi iltintw cinnio floor 


GF.MUL32.N 


C^roun rlnflrinn-r^ninf miiltiolxy einnia n^famr»f 
iiwaiii ly "pun u inuilipiv biriQIS neafcSl 


GF.MUL.32.T 


GrouD floating-point multiply single truncate 


GF.MUL32.X 


Group floating-point multiply single exact 


GF.MUL.64 


Group floating-point multiply double 


GF.MUL.64.C 


Group floating-point multiply double ceiling 


GF.MUL.64.F 


Group floating-point multiply double floor 


GF.MUL.64.N 


Group floating-point multiply double nearest 


GF.MUL.64.T 


Group floating-point multiply double truncate 


GF.MUL64.X 


Group floating-point multiply double exact 





op 


prec 


round/trap 


add 


ADD 


16 32 64 128 


MONE C F N T X 


divide 


DIV 


16 32 64 128 


NONE C F N T X 




MUL 


16 32 64 128 


NONE C F N T X 



Eonnat 



GF.op.prec. round rc=ra.rb 

31 24 23 18 17 12 11 65 0 

I GF.prec I ra I rb I rc I op,rou;;5n 

8 6 6 6-6 

Description 

The contents of registers ra and rb are combined using the specified flooring-point 
operation. The result is placed in register rc. The operation is rounded using the 
specified rounding option or using round-to-nearest if not specified. If a rounding 
option is specified, the operation raises a floating-point exception if a floating point 
invalid operation^ divide by zero, overflow, or underflow occurs^ or when specified, 
if the result is inexact. If a rounding option is not specified, floating-point 
exceptions arc not raised, and arc handled according to the default rules of IEEE 
754. 

Definition 

def GroupFloatingPoint(op.prec.round.ra.rb.rc) as 
a RegRead(ra. 128) 
b RegRead(rb. 128) 
for i «- 0 to 128-prec by prec 

ai4- F(prec.3i^prec-l..i) 

bi F(prec.Oi.,pfec-i..i) 

if rounditNONE then 
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if isSignallingNaN(ai) I isSignallingNaN{bi) 
raise FloaiingPointExceplion 

endif 

case op of 
F.piV: 

if bi=0 then 

raise FloatingPointArithmetic 

endif 
. others: 
endcase 

endif 

case op of 
GF.ADD: 

ci ai+bl 
GF.MUL; 

ci ai*bi 
GF.DIV.: 

ci <- ai/bi 

endcase 
case op of 

GF.ADD. GF.MUL GF.DIV: 

Ci*pfec-i..i <- PacKF(pr5C. ci) 

endcase 
endfor 
endcase 
case round of 

X: 
N: 
T: 

F: 
C: 

NONE: 
endcase 
if rcQ then 

raise Reservedlnstruction 

endif 

RegWrite(rc. 128. c) 
endcase 
enddef 

Exceptions 

Reserved instruction 
Floaiing-point arithmetic 
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These operations take two values from registers, perform floating-point arithmetic 
on groups of bits in the operands, and place the concatenated results in a register. 



Ccerat'on codes 



GF.SET.E.16 


GrouD floatino-Doint set pnnai half 

^— * " f>i' • ' V I • • • ^ wt 1 11 1^ w I c u cm 1 loll 1 


GF.SET.E.16.X 


GroUD floStinQ-DOint ^^pt Pnii?^l half ovart 


GF.SET.E.32 


CiTnun f loA tinn-nninf cot omiol cinoio 


GF.SET.E.32.X 


Groun flnsfinn-nnint cot omtoi cinr^io ay^^i^* 
Vi.il v^u^ iiwaiii I^'^JUH 11 ocl cv^Udl SiriQlS 6XaCl 


GF SET E 64 


oiuup nodui i^-puiru sei eqUdl OOUDIG 


GF SET F 64 X 


oruup iKjaiinu-poini S6i equ3i uouDie exsct 


GF ^FT GF Ifi y 


uroup noaiinu-poini set greater or equal half exact 


ric OCT ric on y 


broup floating-poinr set greater or equal sincie exact 




Group fioaiing-poini set greater or equal doucle exact 


or.oC 1 .L. 1 O.A 


Group fioatinp-point set less half exact 


or. be 1 L.od.A 


Group floating-point set less single exact 


or.oc 1 .L.b4.A 


Group floating-point set less double exact 


Ur.ot 1 .iNt. ID 


Group floating-point set not eoual half 


HiC Q.CT t\\C 'id V 

or.ot 1 Inc. lb. X 


Group floating-point set not equal half exact 


r^C CCT MC 


Group floating-point set not eaual single 


nc CCT MC-OO v 

Ur.bbI .Nb.32.X 


Group floating-point set not equal single exact 




Group floating-point set not equal double 


bh.bbi .Nb.b4.X 


Group floating-point set not equal double exact 


PC CCT MPC "« N/ 

br.bb 1 .Nbb.iD.X 


Group floating-point set not greater or equal half exact 


PC CCT MPC OO V 


Group floating-coini set not greater or eguai single exact 


r^C CCT MPC C/l V 

or. be 1 .lNL3C.b4.A 


Group floating-coint set not greater or eaual double exact 


Pp QPT M! 1P Y 
or.oc 1 .l>JL. ID. A 


Group floating-point set not less halt exact 


PC CCT Ml Y 


Group floating-point set not less single exact 


PC cpT Ml Y 
or .oC 1 .IML.D^.A 


Group floating-point set not less double exacl 


GF SFT NIJF 1fi 


oiuup 1 ludiiny-puiru bei noi unoraereo or eouai naiT 


GF.SET.NUE.16.X 


Group tloating-pcint set not unordereo or eaual haif exact 


GF.SET.NUE.32 


Group floating-point set not unoraereo or eauai single 


GF.SET.NUE.32.X 


Group flcating-pcint set not unordered or ecjuai singie exact 


GF.SET.NUE.64 


Group floating-point set not unoraereo or ecuai oouble 


GF.SET.NUE.64.X 


Group floaling-poini set not unordereo or equal ooubie exact 


GF.SET.NUGE.16 


Group floating-point set not unordered greater or equal halt 


GF.SET.NUGE.32 


Group floating-point set not unordered greater or equal single 


GF.SET.NUGE.64 


Group noaiing-oc:ni set noi unoroerea greater or equal douDie 


GF.SET.NUL16 


Group floating-point set not unordered or less half 


GF.SET.NUL.32 


Group floating-point set not unordered or less single 


GF.SET.NUL.64 


Group floating-point set not unordered or less double 


GF.SET.UE.16 


Group floating-point set unordered or equal half 


GF.SET.UE.16.X 


Group floating-point set unordered or equal half exact 


GF.SET.UE.32 


Group floating-point set unordered or equal single 
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GF.6ET.UE.62.X 


Group floating-point set unordered or equat smgie exact 


GF.SET.UE.64 


Group floating-DCint set unordered or equal double 


GF.SET.UE.64.X 


Group floating-point set unordered or equal double exact 


GF.SET.UGE.16 


Group floating-point set unordered greater or eoual half 


GF.SET.UGE.32 


Group floating-point set unordered oreater or equal single 


GF,SET.UGE.64 


Group floating-poini set unordered greater or equal double 


GF.SET.UL.16 


Group floating-ooint set unordered or less half 


GF.SET.UL32 


Group floating-point set unordered or less single 


GF.SET.UL.64 


Group floaiing-pcint set unordered or less double 


GF.SUB.16 


Group floating-point subtract half 


GF.SUB.16.C 


Group floating-point subtract half ceiling 


GF.SUB.16.F 


Group floating-point subtract half floor 


GF.SUB.16.N 


Group floating-ooint subtract half nearest 


GF.SUB.16.T 


Group floating-point subtract half truncate 


GF.SUB.16.X 


Group floating-point subtract half exact 


GF.SUB.32 


Group floating-pcini subtract single 


GF.SUB.32.C 


Group floating-coint subtract single ceiling 


GF.SUB.32.F 


Group floating-ooint subtract sinole floor 


GF.SUB.32.N 


Group floatinq-Doint subtract sinnip nparp<;t 


GF.SUB.32.T 


Group floating-DOint subtract single truncate 


GF.SUB.32.X 


Group floating-point subtract single exact 


GF.SUB.64 


Group floating-point subtract double 


GF.SUB.64.C 


Group floating-point subtract double ceiling 


GF.SUB.64. F 


Group floating-point subtract double floor 


GF.SUB.64. N 


Group floating-point subtract double nearest 


GF.SUB.64.T 


Group floating-point subtract double truncate 


GF.SUB.64.X 


Group floating-point subtract double exact 





op 


prec 


round/trap 


set 


SET. 

E NE 
UE NUE 


16 32 64 


noneX 




SET. 

NUGE NUL 
UGE UL 


16 32 64 


NONE 




SET. 

L GE 
NL NGE 


16 32 64 


X 


subtract 


SUB 


16 32 64 


NONE C F N T X 
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Format 



GF.op.prec. round 

31 24 23 



rc=rb.ra 



18 17 




6 5 



0 



\ GF.prec \ 

8 



ra 



5 



rc 



6 



op.round 



Description 

The contents of registers ra and rb are combined using the specified floating-point 
operation. The result is placed in register rc The operation is rounded using the 
specified rounding option or using round-to-nearest if not specified. If a rounding 
option is specified, the operation raises a floating-point exception if a floating-point 
invalid operation, divide by zero, overflow, or underflow occurs, or when specified, 
if the result is inexact. If a rounding option is not specified, floating-point 
exceptions are not raised, and are handled according to the default rules of IEEE 
754. 



Definition 

def GroupFloatingPointReversed(op.prec.round.fa.rb.rc) as 
a RegRead(ra. 128) 
b <- RegRead(rb. 128) 
for i 0 to 128-prec by prec 
ai 4- F(prec.ai^prec-i..i) 
bi*- F(prec.bi+prec.i..i) 
if round?«tNONE then 

if isSignalllngNaN(ai) I isSignallingNaN(bi) 
raise FloatingPointException 

endif 

case op of 

GF.SET.L. GF.SET.GE. GF.SET.NL. GF.SET.NGE: 
if isNaN{ai) I isNaN(bi) then 

raise FloatingPointArithmeiic 

endif 

endcase 

endif 

case op of 
GF.SUB: 

ci bi-ai 
GF.SET.NUGE. GF.SET.L: 

ci 4- bi?iai 
GF.SET.NUL. GF.SET.GE: 

ci bi!?<ai 
GF.SET.UGE. GF.SET.NL: 

ci 4- bi?>ai 
GF.SET.UL. GF.SET.NGE: 

ci 4— bi?<ai 
GF.SET.UE: 

ci <- b?=ai 
GF.SET.NUE: 

ci bi!?=ai 
GF.SET.E: 

ci *" bi=ai 
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GF.SET.nH: 
ci bbal 

endcase 

case op of 
GF.SUB: 

Cj+prec-i.-i PackF(prec. ci) 
GF.SET.NUGE. GF.SET.NUL GF.SET.UGE. GF.SET.UL 
GF.SET.L GF.SET.GE. GF.SET.E. GF.SET.NE. GF.SET.UE. 6F.SET.NUE: 

endcase 
endfor 
endcase 
case round of 

X: 

N: 

T: 

F: 

C: 

NONE: 
endcase 

RegWrite(rc. 128. c) 
endcase 
enddef 

^xc^otions 

Reserved instruction 
Floating-point arithmetic 
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Group Floating-point Ternary 

These operations perform floating-poini arithmetic on three groups of floating- 
point operands contained in registers. 



Ooeration codes 



GF.MULADD.16 


Group floating-point multiply and add half 


GF.MULSUB.16 


Group floating-point multiply and subtract half 


GF.MULADD.32 


Group floating-point multiply and add single 


GF.MULSUB.32 


Group floating-point multiply and subtract single 


GF.MULADD.64 


Group floating-point multiply and add double 


GF.MULSUB.64 


Group floating-point multiply and subtract double 





OD 


prec 


multiply and add 


MULADD 


16 


32 


64 


multiply and subtract 


MULSUB 


16 


32 


64 


Format 












GF.operation.iype 

31 24 23 


rd=ra,rb.rc 

18 17 12 11 




6 5 


0 


1 .p 1 


ra 


1 rb 1 


rc 


1 


rd 1 


8 


6 


6 


6 




6 



Descriotion 

The contents of registers ra and rb are taken to represent a group of floaiing-poini 
operands and pairvvise are multiplied together and added to or subtracted from 
the group of floating-point operands taken from the contents of register rc. The 
results are concatenated and placed in register . The results are rounded to the 
nearest representable floating-point value in a single floating-point operation. 
Floating-point exceptions are not raised, and are handled according to the dciauk 
rules of IEEE 754. These instructions cannot select a directed rounding mode or 
trap on inexact. 

Definition 

def GroupFloatingPointTemary{op.prec.ra.rfc).rc.rd) as 
a <- RegRead{ra. 128) 
b 4- RegRead(rb. 128) 
c f- RegReadirc. 128) 
for i 0 to 128-prec by prec 
ai <- F(prec.ai+prec-1..i) 
bl 4- F(prec.bi^.prec-l..i) 
Ci f- FCprecCi^prec-i. i) 
case op of 

GF.MULADD: 

di ^ (ai • bi ) + ci 
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GF.MULSUB: 

di 4- (ai • bi ) - Ci 

endcase 

di+prec-i..i ^ PackF(prec. di) 
endfor 

RegWrite(re. 128, d) 
enddef 

Exceptions 

Reserved instruction 
Floating-point arithmetic 
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Group Floating -point Unary 

These operations take one value from a register . perform floating-point arithmetic 
on groups of bits in the operands, and place the concatenated results in a register . 



Ooeratinn nndfifi 



GF.ABS-16 


GrouD floatinn-mmt ahcnlnto \/oiiia half 


GF.ABS.16.X 


v^iuu^ MWcaiM iv^ L/oiiii aUoUIULu VdlUe MaM 6X3CI 


GF ABS 32 


vjiiuup MUdUl ly-uGini aUSOIUIB VaiU6 SlOgi© 




oioup iioaiing-ccini aDsoiuts valuB single exact 




oroup Tioaiing-pcint absolute value double 


rif. ARC Y 


Group floating-ccint absolute value double exact 


CX^ nPfTl ATP QO 
or.UcrLA 1 t.Jti 


Group floaiing-point convert half from single 


or.UbrLA 1 t.o^.U 


Group floa^ng-cc:": convert half from single ceiling 


OC PkCCl ATC OO C 

or.UcrLA 1 h.od.r 


Group floating-ccmt convert half from single floor 


oc r\^d ATE!: oo ki 

Ur.UtrLA 1 C.32.N 


Group floating.pc convert half from single nearest 


GF. DEFLATE. 32. T 


Group fioa::nG-pG convert half from single truncate 


GF. DEFLATE. 32.x 


Group tloating-OGint convert half from single exact 


GF. DEFLATE. 64 


Group floating-ccint convert single from double 


GF. DEFLATE. 64. C 


Group floating-point convert single from double ceiling 


GF. DEFLATE. 64. F 


Group tlOai:rg-pOin: convert single from double floor 


GF. DEFLATE. 64. N 


Group floaiing.poir; convert single from double nearest 


GF. DEFLATE. 64. T 


Group floaiirg-pO!ni convert single from double truncate 


GF. DEFLATE. 64.x 


Group floating-poir.i convert single from double exact 


GF. FLOAT. 16 


Group floating-point convert half from integer 


GF.FL0AT.16.C 


Group floaiing-Doinc convert half from integer ceiling 


GF. FLOAT. 16. F 


Group floating-coint convert half from integer floor 


Qar. FLOAT. 16. N 


Group floaung-Dc-.; convert half from Integer nearest 


Gr.FLOAT.16.T 


Group floaiing-pG:r,t convert half from integer truncate 


Gr.hLOAl .lo.X 


Group floating-point convert half from integer exact 


riP PI HAT '\o 


Group floating-point convert single from integer 


GF. FLOAT. 32.C 


Group floaiing-poim convert single from integer ceiling 


GF. FLOAT. 32. F 


Group floating-point convert single from integer floor 


GF.FLOAT.32.N 


Group fioaiing-pomi conven single from integer nearest 


GF.FLOAT.32.T 


Group floating-point convert single from integer truncate 


GF.FLOAT.32.X 


Group floaung-poini convert single from integer exact 


GF.FLOAT.64 


Group floating-point convert double from integer 


GF.FLOAT.64.C 


Group floaTing-point convert double from integer ceiling 


GF.FLOAT.64. F 


Group flca:inc-point convert double from integer floor 


GF.FLOAT.64. N 


Group tloa:ing-po;ni convert double from integer nearest 


GF.FLOAT.64.T 


Group floaiing-pcini convert double from integer truncate 


GF. FLOAT. 64.x 


Group floating-point convert double from integer exact 


GF.INFLATE.16 


Group floating-Doint convert single from half 


GF.INFLATE.16.X 


Group floating-point convert single from half exact 


GF.iNFLATE.32 


Group floating-point convert double from single 
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GF.INFLATE.32.X 


Group floatir.Q-ooint convert douhip frnm cinm^i e^^^^^^^i 


GF.NEG.16 


GrouD floatina-DOint nenatp h;^if 


GF.NIEG.16.X 


GrouD floatina-DOint neaatp hpif pvarr 


GF.NEG.32 


GrouD floatino-nntnt npnatp cinniQ 


GF.NEG.32.X 


>^iv^u^ liuauiiy puiui iicyalo Sinyi6 6X3Ci 


GF.NEG.64 


GroUD flostinn-nnint nonaio rfnnKfn 


GF.NEG.64.X 


v^iuufj nuaiM lU'PUli 11 fltiydl6 00UDI6 GX3CI 


GF.SINK.16 


iiv^aiiiiu }juiiii ^uiivun inisQer rrom nsir 


GF.SINK.16.C 




GF.S1NK.16.F 


oivjufj ijudnnu-poini convGri intGQer from half floor 


GF.SINK.16.N 


^n-'uf-r iiuatiM\j^ ljuiru uuriveri inieyer From nait nearssi 


GF SINK 16 T 


vjtuufj nuduf .y-poini convea integer rrom half truncate 


GF SINK 16 X 


oruup iioaiinq-poini convert integer from half exact 


GF SINK 32 


oruup iiudiing-point convert integer from single 


GF SINK r 


oroup rioaiing-poini convert integer from sinole ceiling 


GF SINK 32 F 


^luup iiwdunu-pomi convert integer fronn single floor 


GF SINK 3P N 


Ciroup tloatir.G-poini convert integer from sinqie nearest 


GF ^INK '^P T 


C3roup floating-point convert inteoer from single truncate 


r^p Q|Mt< y 
nJi .OilMPi.OC.A 


Group floating-point convert integer from single exact 


GF ^INK fid 


Group floating-Doint- convert integer from double 


vjir. ON MfX-DH 


Group floating-point convert integer from double ceiling 




Group floating-point convert integer from double floor 




Group floating-point convert integer from double nearest 


r^P QiMk T 


Group tloating-point convert integer from double truncate 


r^P QlMk^ C^A Y 
vjr.ollNtx.DH.A 


Group floating-point convert integer from double exact 


r^P ^OR 1fi 
or. Own. 1 D 


Group floating-point square root half 


vjii .Own. lo.o 


Group floating-point square root half ceiling 


r^P QHR t 
Lap. own. iD.r 


Group floating-point square root half floor 


GF ^OR 1R NJ 


Group floating-point square root half nearest 


rip COR ifi T 
on, own. 1 D. 1 


Group floating-point square root half truncate 


HP ^OR 1fi Y 

Oi .own. ID. A 


Group floating-point square root half exact 


GF ^OR 3? 


Group floating-point square root sinole 


GF SOR 3P P 


uroup iioaiing-point square root sinole ceiling 


GF SOR 3P F 


Group tloating-point square root sinole floor 


GF SOR 3P N 


Group tloating-point square root sinoie nearest 


GF.SQR,32.T 


Group floating-point square root sinole truncate 


GF.SQR.32.X 


Group floating-point square root sinole exact 


GF.SQR.64 


Group floating-point square root double 


GF.SQR.64.C 


Group floating-point square root double ceiling 


GF.SQR.64.F 


Group floating-point square root double floor 


GF.SQR.64.N 


Group floating-point square root double nearest 


GF.SQR.64.T 


Group floating-point square root double truncate 


GF.SQR.64.X 


Group floating-point square root double exact 



174 



Case 2:O5-cv-0O505-TJW Document 1 49 Filed 1 0/1 5/2007 Page 1 8 of 40 

wo 97/07450 PCT/US96/13047 





op 


prec 


round/trap 


absolute 

value 


ABS 


16 32 64 


NONE X 


float from 
integer 


FLOAT 


16 32 64 


NONE C F N T X 


integer 
from float 


SINK 


16 32 64 


NONE C F N T X 


increase 

format 

precision 


INFLATE 


15 32 


NONE X 


decrease 

format 

precision 


DEFLATE 


32 64 


NONE C F N T X 


square root 


SQR 


16 32 64 


NONE C F N T X 



Formp.T 

GF.op.prec.round rc=ra 



31 24 


23 


18 


17 12 


11 


6 


5 0 


GF.prec 


ra 


op 


rc 


UNARY, 
round 


8 




6 


6 




6 


6 



Description 

The contents of register ra is used as the operand of the specified floating-point 
operation. The result is placed in register rc. The operation is rounded using the 
specified rounding option or using round-to-nearest if not specified. If a rounding 
option is specified, the operation raises a floating-point exception if a floating-point 
invalid operation, divide by zero, overflow, or underflow occurs, or when specified, 
if the result is inexact. If a rounding option is not specified, floating-point 
exceptions are not raised, and are handled according to the default rules of IEEE 
754. 

Definition 

del GroupFloatingPointUnary{op.prec.round.ra.rb.rc) as 
a <- RegRead(ra. 128) 
case op of 

GF.ABS. GF.NEG. GF.SQR: 

for i 0 to 128-prec by prec 
ai F{prec.ai+prec-i..i) 
case op ol 
GF.ABS: 

if ai < 0 ihen 
ci -ai 

else 

ci ai 

endif 
GF.NEG: 
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ci -ai 
GF.SOR: 

Ci ^ VaT 

endcase 

ci^prec-i..i <- PackF(prec. ci. round) 
endfor 
GF.SINK: 

for 0 to 128-prec by prec 
ai ^ F(prec.ai^orec.lj) 
Cii-prec-i..i ai 
endfor 
GF.FLOAT: 

for i <~ 0 to 128-prec by prec 
ai ^ Si+prec-i.-i 

Cii-prec-v.i PackF(prec.ai. round) 
endfor 
GFJNFLATE: 

for i 0 to 64-prGC by prec 

ai Rprec.ai+prsc.l. i) 

Ci-fi^.prsc+prec-i..i+i P3CkF(prec+prec.ai. round) 
endfor 
GF.DEFLATE: 

for i 0 to 128-prec by prec 

ai <- F(prec,ai+prec-ij) 

ci/2-»-prec/2-i.j72 ^ PackF(prec/2.ai. round) 
endfor 

endcase 
REC[rc] <- c 
enddef 

Exception.^ 

Reserved instruction 
Floating-point arithmetic 
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Load 

These operations add the contents of two registers to produce a virtual address 
load data from memory, sign- or zero-extending the data to fill the destination 



Oceratinn coc/e.q 





Lo3d signed bytG 


L 16 R 


Loaa Signed doublet biq-endian 


I 1fi R A 

1 — 1 O.D.M 


Load signed doublet big-endian aligned 


L. lO.L 


Load signed doublet little-endian 


L. ID.L.M 


Load signed doublet little-endian aligned 


L32.B 


Load signed auadlet bin-pnriian 


L32.B.A 


Load signed auadlet biq-endian aligned 


L.32.L 


Load signed auadlet little-endian 


L32.L.A 


Load signed auadlet little-endian aligned 


L64.B27 


Load octlei biQ-endian 


L.64.B.A2a 


Load octlet big-endian aligned 


L.64.Lii9 


Load octlet httle-endian 


L.64.L.A30 


Load octlet little-endian aligned 


L.128.B31 


Load hexlet biq-endian 


L.128.B.A32 


Load hexlet big-endlan aligned 


L.128.L33 


Load hexlet little-endian 


L.I 28. LAS" 


Load hexlet little-endian aligned 




Load unsigned byte 


L.U.16.B 


Load unsigned doublet big-endian 


L.U.16.B.A 


Load unsigned doublet big-endian aligned 


L.U.16.L 


Load unsigned doublet little-endian 



26L.8 need not distinguish between little-endian and big-endian ordering, nor between aligned 
and unaligned^ as only a single byte is loaded. 

2a.64.B need not distinguish between signed and unsigned, as the octlet fills the destination 
register. 

281.64.B.A need not distinguish between signed and unsigned, as the octlet fills the destination 
register. 

^^L.64.L need not distinguish between signed and unsigned, as the octlet fills the destination 
register. 

^^L.64.L.A need not distinguish between signed and unsigned, as the ocdet fills the destination 
register, 

^^L.i28.B need not distinguish between signed and unsigned, as the hexlet fills the destination 
register pair. 

^^L.US.B.A need not distinguish between signed and unsigned, as the hexlet fills the 
destination register pair. 

^'L.m.L need not distinguish between signed and unsigned, as the hexlet fills the destination 
register pair. 

^■*L.128-L.A need not distinguish between signed and unsigned, as the hexlet fills the 

destination register pair. 

^^L.U8 need not distinguish between little-endian and big-endian ordering, nor between aligned 
and unaligned, as only a single byte is loaded. 
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L.U. 1 D.L.A 


Load unsigned doublet little-endian aligned 


LU.32.B 


Load unsioned Quadlet bio-pnriian 


LU.32.B.A 


Load unsigned quadlet big-endian aligned 


LU.32.L 


Load unsigned quadlet- little-endian 


LU.32.L.A 


Load unsigned quadlet little-endian aligned 


LU.64.B 


Load unsigned octlet big-endian 


L.U.64.B.A 


Load unsigned octlet big-endian aligned 


LU.64.L 


Load unsigned octlet little-endian 


LU.64.L.A 


Load unsigned octlet little-endian aligned 



number format 


type 


size 


ordering 


alignment 


signed byte 




8 






unsigned byte 


U 


8 






signed integer- 




16 32 64 


L B 




signed integer aligned 




16 32 64 


L B 


A 


unsigned integer 


U 


16 32 64 


L B 




unsigned integer aligned 


u 


16 32 64 


L B 


A 


register 




128 


L B 




register aligned 




128 


L B 


A 



Format 



op rc=ra.rb 

^ 13 17 12 11 6 5 0 

I L.MINOR I ra I rb I rc I OP 1 

8 6 6 6 6 

Description 

A virtual address is computed from the sum of the concents of register ra and 
register rb. The contents of memory using the specified byte order is treated as 
the size specified and zero-extended or sign-extended as specified, and placed mto 

register rb. 

If alignment is specified, the computed virtual address must be aligned, that is. ii 
must be an exact multiple of the size expressed in bytes. If the address is not 
aligned an "access disallowed by vinual address" exception occurs. 

Definition 

def Load{op.ra.rb.rc) as 
case op of 

L16L. L32L L8. L16LA. L32LA. L16B. L32B. L16BA. L32BA 
L64L. L64LA. L64B. L64BA; 
signed <— true 

LU16L. LU32L LU8. LU16LA. LU32LA. LU16B. LU32B. LU16BA LU32BA 
LU64L. LU64LA, LU64B. LU64BA: 

signed 4- false 
L128L L128LA. L128B. L128BA: 
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signed <- undefined 

endcase 
case op of 
L8. LU8: 
Size <- 8 

L16L LU16L. L16LA. LU16L^. L16B. LU16B. L16BA. LU16BA: 
Size 1 6 

L32L. LU32L. L32LA. LU32LA. L32B. LU32B. L32BA. LU32BA: 
Size 32 

L64L. LU64L. L64LA. LU64LA.. L54B. LU64B. L64BA. LU64BA: 

size ir- 64 
L128L L128LA. L128B. L128BA: 

size f- 128 

endcase 
case op of 

L16L LU16L L32U LU32L L64L LU64L. L128L 
L16LA. LU16LA. L32LA. LU32LA. L64LA. LU64LA. L123LA: 
order L 

L16B. LU16B. L32B. LU32B. L6^B. LU64B. L128B 

nSBA. LU15BA. L32BA. LU32BA. L64BA. LU64BA. L12SBA: 

order B 
L3. LU8: 

order ^ undefined 

endcase 
case op of 

L16L. LU15L. L32U LU32L L64L. LU64L. L128L 
L16B. LU16B, L32B. LU32B. L64B. LU64B. L128B: 
align t- false 

L16LA. LU16LA, L32LA. LU32LA. L64LA. LU64LA L128LA 
L16BA. LU16BA. L32BA. LU32BA. L64BA. LU64BA. L128BA: 

align <- true 
L8. LU8: 

align «- undefined 

endcase 

a ^ RegRead{ra. 64) 
b RegRead(rb. 64) 
VirtAddr a + b 
if align then 

if (VirtAddr and ((size/8)-1)) ^ 0 then 

raise AccessDisallowedByVirtualAddress 

endif 

endif 

m LoadMemory(VirtAddr.size.order) 
mx ^ (msi2e-i and signed) ^28-size n 
case size of 

8. 16, 32. 64: 

RegWrite(rc, 64. mxga .o) 

128: 

RegWriie(rc. 128. mx) 

endcase 
enddef 

Exceptions 

Reserved instruction 

Access disallowed by virtual address 

Access disallowed by tag 

Access disallowed by global TLB 
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Access disallowed by local TLB 

Access detail required by tag 

Access detail required by local TLB 

Access detail required by global TLB 

Cache coherence intervention required by tag 

Cache coherence inter\'ention required by local TLB 

Cache coherence intervention required by global TLB 

Local TLB miss 

Global TLB miss 
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Load Immediate 

These operations add the conrenrs ot a register to a sign-extended immediate 
value to produce a virtual address, load data from memory, sign- or zero-extending 
the data to fill the destination register. 



Coeration codes 



L.8.136 


Load signed byte immediate 


L16.B.A.I 


Load signed doublet big-endian aligned immediate 


L16.B.I 


Load signed doublet big-endian immediate 


L.16.L.A.I 


Load signed coublet little-endian aligned immediate 


L16.LI 


Load signed doublet little-endian immediate 


L32.B.A.I 


Load signed quadlet big-endian aligned immediate 


L32.B.I 


Load signed ouadlet big-endian immediate 


L.32.LA.I 


Load srgnea cuadiet little-endian aligned immediate 


L32.L.I 


Load signea nuadlet little-endian immediate 


L64.B.A.I37 


Load ociiei big-endian aligned immediate 


L.64.B.l^o 


Load octlei oig-endian immediate 


L64.L.A,|39 


Load octlet little-endian aligned immediate 


L.64.L.I''C 


Load octlet little-endian immediate 


Liae.B.A.i"! 


Load hexlet big-endian aligned immediate 


L128.B.I« 


Load hexlet big-endian immediate 


L.128.L.A.H3 


Load hexlet little-endian aligned immediate 


L128.L.I^^ 


Load hexlet little-endian immediate 




Load unsigned byte immediate 


LU.16.B.A.I 


Load unsigned doublet big-endian aligned immediate 


L.U.16.B.I 


Load unsigned doublet big-endian immediate 


LU.16.L.A.I 


Load unsigned doublet little-endian aligned imnnediate 



-^^L.S.I need not distinguish between little-endian and big-endian ordering, nor between aligned 
and unaligned, as only a single byte is loaded. 

^'L64.B.A.I need not distinguish between signed and unsigned, as the ocilei fills the 
destination register. 

^^L.64.B.I need not distinguish between signed and unsigned, as the octlet fills the destination 
register. 

L.A.I need not distinguish between signed and unsigned, as the octlet fills the 
desiinacion register. 

^^L.6-I.L.I need not distinguish between signed and unsigned, as the octlet fills the destination 
register. 

"^^L. 128. B.A.I need not distinguish between signed and unsigned, as the hexlet fills the 
destination register pair. 

■^•^LI2S.B.I need not distinguish between signed and unsigned, as the hexlet fills the destination 
register pair. 

■'^L.12S.L.A.I need not distinguish between signed and unsigned, as the hexlet fills the 
destination register pair. 

''"*L.128.L,I need not distinguish between signed and unsigned, as the hcxlci fills the destination 
register pair. 

*^^L.L'8.I need not distinguish between liitie-endian and big-endian ordering, nor between 
aligned and unaligned, as only a single byte is loaded. 
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L.U.ib.L.i 


Load unsigned doublet little-endian immediate 


L.U.32.B.A.I 


Load UnSiOned OuadiSt blO-Pndi;^n aWnnari immaHiata 


LU.32.B.I 


Load unsigned quadlet big-endian immediate 


L.U.32.LA.I 


Load unsigned quadlet little-endian aligned immediate 


LU.32.L.I 


Load unsigned quadlet little-endian immediate 


LU.64.B.A.I 


Load unsigned octlet big-endian aligned immediate 


LU.64.B.I 


Load unsigned octlet biq-endian immediate 


LU.64.L.A.I 


Load unsigned octlet little-endian aligned immediate 


LU.64.LI 


Load unsigned octlet little-endian immediate 



number format 


type 


size 


ordering 


alignment 


signed byte 




3 






unsigned byte 


U 


8 






signed integer 




16 32 64 


L B 




signed integer aligned 




16 32 64 


L B 


A 


unsigned integer 


u 


16 32 64 


L B 




unsigned integer aligned 


u. 


16 32 64 


L B 


A 


register 




128 


L B 




register aligned 




128 


L B 


A 



Format 



op rb=ra,offset 

31 24 23 18 17 12 11 Q 

I op I ra I rb I offset | 

8^ 6 6 12 

Descrintion 

A virtual address is computed from the sum of the contents of register ra and the 
sign-extended value of the offset field. The contents of memory using the specified 
byte order is created as the size specified and zero-extended or sign-extended as 
specified, and placed into register rb. 

If alignment is specified, the computed virtual address must be aligned, that is, it 
must be an exact multiple of the size expressed in bytes. If the address is not 
aligned an "access disallowed by virtual address" exception occurs. 

Definition 

def Loadlmmediate(op.ra.rb.offsei) as 
case op ol 

L16LI. L32Li. LSI. L15LA1. L32LAI, L16B1. L32BI. L16BAI. L32BAI- 
L64LI. L64LAI. L64BI, L64BAI: 

signed 4- true 
LU16LI. LU32U. LU8I. LU16LA!. LU32LAI 
LU16BI. LU32BI. LU15BAI. LU32BAI: 
LU64U. LU64LA1. LU64BI. LU64BAI: 

signed <- false 
L128U. L128LAI. L128BI. L128BAI: 
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signed 4— undefined 

endcase 
case op of 

LSI. LU8!: 
size ^8 

L16LI. LU16LL L16LAI LU16LAI. L16BI. LU16BI. L16BAI. LU16BAI: 

size <- 16 

L32LI. LU32LI. L32LAI. LU32LAI. L32BL LU32BI. L32BAI. LU32BAI: 
size 4-. 32 

L64LI. LU64LI. L64LAI. LU64LAI, L64BI. LU64BI. L64BAI. LU64BAI: 

Size ^ 64 
L128LI. L128LAI. L128BI. L128BAI: 

size 4- 128 

endcase 
case op of 

L16U. LU16LI. L32L!. LU32LI. L64LI. LU64U. L128L! 
L16BI. LU16BI. L32BI. LU32BI. L64BI. LU64BI. L128BI: 

align false 

L16LAI. LU16LAI. L32LA.L LU32LAI. L64LAI. LU64LAI. L128LAI 
L16BAI. LU16BAI L32BA!, LU32BAI. L64BAI. LU64BAI. L128BAI: 

align true 
LSI. LU8I: 

align undefined 

endcase 
case op of 

L16U. LU16LI. L32LI. LU32L1. L64LI. LU64LI L128LI 
L16LAL LU16LAI. L32LAI. LU32LAK L54LAL LU64LAI. L128LAI: 
order L 

L16BI. LU16BI. L32BI. LU32BL L64BL LU64BI, L128BI. 

L16BAI. LU16BAL L32BAI. LU32BAI. L64BAL LU64BAI. L128BAI: 

order <- B 
L8L LU8I: 

order <- undefined 

endcase 

a RegRead(ra. 64) 

VirtAddr <- a + (offset-n^^ n offset) 

if align then 

if (VirtAddr and ((size/8)- 1)) ^ 0 then 

raise AccessDisallowedByVirtualAddress 

endif 

endif 

b <- LoadMemory(\/irtAddr.si2e.order) 
bx 4- (bsi29-i and signed)^28-size h ^ 

RegWrite(rb. 64. bx) 
enddel 

Exceptions 

Reserved instruction 

Access disallowed by virtual address 

Access disallowed by tag 

Access disallowed by global TLB 

Access disallowed by local TLB 

Access detail required by tag 

Access detail required by local TLB 

Access detail required by global TLB 

Cache coherence intervention required by tag 
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Cache coherence inten'ention required by local TLB 
Cache coherence incerv-ention required by global TLB 
Local TLB miss 
Global TLB miss 



184 



Case 2:05-cv-(X)505-TJW Document 149 Filed 10/15/2007 Page 28 of 40 

wo 97/07450 PCT/US96/13047 



Store 

These operations add the contents of two registers to produce a virtual add 
and store the contents of a register into memor>-. 



Qceratinn nnrie?^ 







S.16.B 




S.16-B.A 


on^i uuuuitr uiy-;?nuian ailQnGQ 


S.16.L 


oiui t; uuuuic MLUS'SnOISn 


S.16.L.A 


oiore aouDie iiiiie'enaian ahgned 


S.32.B 


oiurt? qudoiei oia-8naian 


S.32.B.A 


oiore quaaiei oiq-endian aligned 




Store quadlet litiie-endian 




Store quadle: littie-endian aligned 




otore octiet bia-endian 


S.64.B.A 


Store octiet bia--nrlian piinnoH 


S.64.L 


Store octiet little-endian 


S.64.L.A 


Store octiet little-endian aligned 


b.128.B 


Store hexlet big-endian 


S.128.B.A 


Store hexlet biq-endian aliqned 


S.128.L 


Store hexlei little-endian 


S.128.L,A 


btore hexiet little-endian aliqned 


S.AAS.64.B.A 


Store add-and-swao octiet biq-endian aliqned 


S.AAS.64.L.A 


Store add-and-swap octiet little-endian aligned 


S.CAS.64.B.A 


Store compare-and-3wap octiet big-endian alianed 


S.CAS.64.L.A 


Store compare-ano-swap octiet httle-endian anqneo 


S. MAS. 64. B. A 


Store multiplex-and-swap octiet big-endian aiianea 


S.MAS.64.L.A 


Store multiplex-and-swap octiet little-endian aiioneo 


S.MUX.64.B.A 


Store multiplex octiet big-endian aligned 


|S.MUX.64.L.A 


Store multiplex octiet little-endian aligneo 



size 


ordering 


alignment 


8 






16 32 64 128 


L B 




16 32 64 128 


L B 


A 



•'^S.S need not specify byte ordering, nor need it specify alignmeni checking, as it stores a single 
byte. 
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Format 

op ra.rb.rc 

31 24 23 18 17 12 11 6 5 0 

t S.MINOR I ra I rb i rc I OP 1 

8 6 6 6 6 

Description 

A virtual address is computed from the sum of the contents of register ra and 
register rb. The contents of register rc, treated as the size specified, is stored in 
memory using the specified byte order. 

If alignment is specified, the computed virtual address must be aligned, that is it 
must be an exact multiple of the size expressed in bytes. If the address is noi 
aligned an "access disallowed by virtual address" exception occurs. 

Definition 

def Store(op. ra.rb.rc) as 
case op of 
S8. 

S16L S16LA. S16B. S16BA. 
S32L. S32LA. S32B. S32BA. 
S64L. S64LA. S64B. S64BA. 
S123L S12SLA. S128B. S128BA: 

function <- NONE 
SAAS64BA. SAAS64LA: 

function i- AAS 
SCAS64BA. SCAS64LA: 

function CAS 
SMAS64BA. SMAS64LA: 

function 4- MAS 
SMUX64BA. SMUX64LA: 

function MUX 

endcase 
case op of 
S8: 

size i- 8 
S16L. S16LA. S16B. S16BA: 

size 4- 16 
S32L. S32LA. S32B. S32BA: 

size 32 
S64L. S64LA. S64B, S64BA. 
SAAS64BA. SAAS64LA: 

size 64 

SCAS64BA. SCAS64LA. SMAS64BA. SMAS64LA. SMUX64BA. SMUX64LA- 

size 64 
S128L S128LA. S128B. S128BA: 

Size <- 128 

endcase 
case op of 
S8. 

S16L S16LA. S16B. S16BA. 
S32L. S32LA. S32B. S32BA. 
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S64L S64LA. S64B. S64BA. 
SAAS64BA. SAAS64LA: 
rsize 4- 64 

SCAS64BA. SCAS64LA. SMAS64BA. SMAS54LA. SMUX64BA SMUX64LA- 

rsize 1 28 
S128L S128LA. S128B. S128BA: 

rsize 4- 128 

endcase 
case op ol 
S8: 

align <- undefined 
S16L S32L Sd4L. S128L. 
S16B. S32B. S64B. S128B: 

align <- false 
S16LA. S32LA. S64LA. S128U\. 
S16BA. S32BA. S64BA. S128BA. 
SAAS64BA. SAAS64LA. SCAS64BA. SCAS54LA 
SMAS64BA. SiVlAS64LA. SMUX64BA. SMUX64LA: 

align <- true 

endcase 
case op of 

SB: 

order undefined 
S15L. S32L S54L. S128L 
S16LA. S32LA. S64LA. S128LA. 
SAAS64LA. SCAS64LA. SMAS54LA. SMUX64LAI: 

order L 
S16B. S32B. S64B. S128B. 
S16BA. S32BA. S64BA. S128BA. 
SAAS64BA. SCAS64BA. SMAS64BA. SMUX64BAI: 

order <— B 

endcase 

a <- RegRead(ra, 64) 
b RegRead(rb. 64) 
ViriAddr a + b 
if align then 

if (VirtAddr and ({size/8)- 1)) * 0 then 

raise AccessDisallowedByVirtualAddress 

endif 

endif 

m 4- RegRead(rc. rsize) 
case function of 
NONE: 

StoreMemory( VirtAddr. size. order, msi2e-i o) 

AAS: 

c 4— Loadf^emory(VirtAddr.si2e.order) 
StoreMemory(VirtAddr.size.order.m63..o-^c) 
RegWrite(rc. 64. c) 

CAS: 

c 4- LoadfVlemory(ViriAddr.size.order) 
if (c = nn63..o) then 

StoreMemory(VlrtAddr.si2e.order.mi27..64) 

endif 

RegWrite{rc. 64. c) 
f^AS: 

c 4- LoadlS/1emory(VirtAddr.size.order) 
n 4- (mi 27. .64 & nig 3 0) 1 (c & -mg3 q) 
StoreMemofy{ViriAddr.size.Gfder.n) 
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RegWriie(rc. 64. c) 

MUX: 

c <- LoadMemory(VirlAddr. 3129. order) 
n ^ (mi27..64 & ^62..0) » (C & -m63 q) 
StoreMemory(VirtAddr.size.order.n) 

endcase 
enddef 

Exceptions 

Reserved instruction 

Access disallowed by virtual address 

Access disallowed by tag 

Access disallowed by global TLB 

Access disallowed by local TLB 

Access detail required by tag 

Access detail required by local TLB 

Access detail required by global TLB 

Cache coherence intervention required by tag 

Cache coherence intervention required by local TLB 

Cache coherence inien-ention required by global TLB 

Local TLB miss 

Global TLB miss 
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Store Immediate 

These operations add the contents of a register to a sign-extended immediate 
value to produce a virtual address, and store the contents of a registerinto 
memory. 



Operation codes 



S.3.1*' 


Store bvt9 immpdiatp 


S.16.B.A.I 


oiuic uuuuiw uiy-tjnuidn dlignGQ imrnGQiate 


S.16.B.I 


oiuit; uuuuit; uig-enuian immediate 


S.16.L.A.I 


otuffcj uouDic iiuie-enaian angned immediate 


S.16.L.I 


oiuft? uuuDie iinie-enaian immediate 


S 32 B A 1 


oiore auaaiwi Dig-enaian alianed immediate 


S 32 B I 


oiore quaoiet Dig-enoian immediate 


S 32.L.A.I 


oiufe qudUJci imie-enaian aligned immediate 


O . . L- . 1 


biore quac:=t little-endian immediate 


S.64. B.A.I 


Store OCtlS; biQ-9ndi?^n ?^linnpri immoHiotp 


S.64.B.I 


Store octle: big-endian immediate 


S.64.L.A.I 


Store octlet little-endian aliqned immediate 


S.64. L.I 


Store octte: little-endian immediate 


S.128.B.A.I 


Store hexlet big-endian aligned immediate 


S.128.B.I 


Store hexlet big-endian immediate 


S.128. L.A.I 


Store hexlei little-endian aliqned immediate 


S.128.L.I 


Store hexlet little-endian immediate 


S.AAS.64.B.A.I 


Store add-and-5wap octlet big-endian aligned immediate 


S.AAS.64. L.A.I 


Store add-anc-swap octlet little-endian aligned immediate 


S.CAS.64.B.A.I 


Store compare-and-swap octlet big-endian aligned immediate 


S.CAS.64.L.A.I 


Store Compar=-=r.c-3waD octlet iitne-enaian angned «mmeaiate 


S.MAS. 64. B.A.I 


Store multipi9x-3nd-swap octlet btg-endian aligned immediate 


S.MAS. 64. L.A.I 


Store multipiex-and-swap octlet hnle-endian aligned immediate 


S.MUX.64. B.A.I 


Store multiplex octlet big-endian aligned immediate 


S.MUX.64.L.A.I 


Store multiplex octlet little-endian aligned immediate 



size 


ordering 


alignment 


a 






16 32 64 128 


L B 




16 32 64 128 


L B 


A 



■*'S.8.1 need not specify byre ordering, nor need it specify alignment checking, as ii stores a 
single byte. 
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Format 

S.size.order. align.! ra.rb.offset 

31 24 23 18 17 12 11 



' I ra I rb I offset 

8 6 fi ' — 



0 12 

Descriotion 

A virtual address is computed from the sum of the contents of register ra and the 
sign-extended vale of the offset field. The contents of register rb, treated as the 
size specified, is stored m memory using the specified byte order. 

If alignment is specified, the computed virtual address must be aligned that is it 
must be an exact mulciple of the size expressed in bytes. If the address is not 
aligned an access disallowed by virtual address" exception occurs. 

Definition 

def Slorelmmedlate(op.ra.rb.offset) as 
case op of 
581, 

S16LI. S16LAI. S16BI. S16BAI. 
S32LI. S32LAI. S32BI. S32BAI. 
S64LI. S64LAI. S64Bi. S64BAI. 
S128LI. S128LAI. S128BI. S12SBAI: 

function NONE 
SAAS64BAI. SAAS64LAf: 

function <- AAS 
SCAS64BAI. SCAS64LAI: 

function <- CAS 
SMAS64BAI. SMAS64LAI: 

function MAS 
SMUX64BAI. SIV1UX64LAI: 

function <- MUX 

endcase 
case op of 
881: 

size 8 
S16LI. S16LAI. S16BI. S16BAI: 

size ^ 16 
S32LI. S32LAI. S32BI. S32BAI: 

Size <- 32 

S64LI. S64LAJ. S64BI. S64BAI, SAAS64BAI, SAAS64LAI 

SCAS64BAI. SCAS64LAI. SMAS64BAI. SMAS64LAI. SMUX64BAI SMUX64LAI' 

size <- 64 
S128LI. S128LAI. S128BI. S128BAI: 

size i- 128 

endcase 
case op of 
S8I. 

S16LI. S16LA1. SIdBI. S16BAI 
S32U. S32LAL S32BI. S32BAI. 
S64LI. S64LAI. S64BI. S54BAL 
SAAS64BAI. SAAS64LAI: 
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rsize 64 

SCAS64BAI. SCAS64LAI. SMAS64BAI. SMAS64LAI. SMUX64BA! SMUX64LAI 

rsize 4-128 
S128L(. S128LAI. S128BI. S128BAI: 

rsize ^ 128 

endcase 
case op of 
581: 

align ^ undefined 
S16LI. S32LI. S64LI. S128L1 
S16BI. S32BI. S64BI. S128BI: 

align false 
S16LAI. S32LAI. S64LAI. S128LAI. 
S16BAI. S32BAI. S54BAI. S128BAI. 
SAAS64BAI. SAAS64LAI. SCAS64BAL SCAS64LAI 
SMAS64BAI. SMAS64LAI. SMUX64BAI. SMUX64LAI: 

align true 

endcase 
case op of 
581; 

order undefined 
S16U. S32LL S64LI. S128LI. 
SISLAI. S32LAL S64LAI. S128LAI. 
SAAS64LAI. SCAS64LAI. SMAS64LAI. SMUX64LAI: 

order L 
S16B1. S32BI. S64BI. S128BI 
S16BAI. S32BAI. S648AI. S128BAI. 
SAAS64BAI. SCAS64BAI. SMAS64BAI. SMUX64BAI: 

order ^ B 

endcase 

a RegRead(ra. 64) 

VirtAddr 4- a + (offset ^ i^O h offset) 

if align then 

if (VirtAddr and {(Sizey8)-1)) ^ 0 then 

raise AccessDisallowedByVirtualAddress 

endif 

endif 

m i- RegRead(rb. rsize) 
case function of 
NONE: 

StoreMemory(VirlAddr.size,order,msi2e-i o) 

AAS: 

b 4- LoadMemory(VirtAddr,si2e.order) 
StoreMennory(\/irtAddr.size.order.m53 o+b) 
RegWrite(rb. 64. b) 

CAS: 

b 4- LoadMemory(VirtAddr. size. order) 
If (b = m63..o) then 

SloreMemory(VirtAddr.si2e.order,mi27..64) 

endif 

RegWrite(rb. 64. b) 
MAS: 

b <— LoadMemory(VirtAddr. size. order) 
n «- (mi27..64 & ^153 0) I (b & -rngs, q) 
StoreMemory(VlrtAddr.si2e.order.n) 
RegWrite(rb. 64. b) 
MUX: 
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b <- LoadMemory(VirlAddr.3i2e.order) 
n <- (mi27..64 & ^53. 0) I (b & -mga, q) 
ScoreMemory(ViriAddr.si29.order,n) 

endcase 
enddef 

Exceptions 

Reser\'ed instruction 

Access disallowed by virtual address 

Access disallowed by tag 

Access disallowed by global TLB 

Access disallowed by local TLB 

Access detail required by tag 

Access detail required by local TLB 

Access detail required by global TLB 

Cache coherence intervention required by tag 

Cache coherence intervention required by local TLB 

Cache coherence intervention required by global TLB 

Local TLB miss 

Global TLB miss 
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Memory ManaaemGnt 

This section discusses the caches, the translation mechanisms, the memory 
mcerfaces, and how the multiprocessor interface is used to maintain cache 
coherence. 

The Terpsichore, processor provides for both local and global virtual addressing, 
arbitrary page sizes, and coherent-cache multiprocessors. The memory 
management system is designed to provide the requirements for implementation 
or virtual machines as well as virtual memory. 

All facilities of the memory management system are themselves memorv mapped 
in order to provide for the manipulation of these facilities by high-level laneuaee* 
complied code. © © » 

The translation mechanism is designed lo allow full bvrc-at-a-time control of 
access to the virtual address space, with the assistance of fast exception handlers. 

Privilege levels provide for the secure transition between insecure user code and 
secure system facilities. Instructions execute at a privUege, specified bv a cwo-bic 
held in the access information. Zero is the least-privileged level and three is the 
most-privileged level. 

The diagram below sketches the organi2acion of the memory management system: 



iocat virtual address 



local virtual 
to global 
virtual 
address 

translation 



address 



cache data 



address cache tag 



global virtual address 



global 
virtual to 
physical 
address 
translation 



.data 

-local 
protection 

global 
'protection 
-hit 



^global 
"protection 



t physical address "1 

Terpsichore memory management 

Starting from a local virtual address, the memory management system performs 
three actions in parallel: the low-order bits of the virtual address are used to 
direcdy access the data in the cache, a Jow-order bit field is used to access the 
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cache tag, and the high-order bits of the virtual address are translated from a local 
address space to a global virtual address space. 

Following these three actions, operations vary depending upon the cache 
implementation. The cache tag may contain either a physical address and access 
control mformaiion (a physically-tagged cache), or may contain a global virtual 
address and global protection information (a virtually-tagged cache). 

For a physically-tagged cache, the global virtual address is translated to a physical 
address by the TLB. which generates global protection information. The cache 
tag IS checked against the physical address, to determine a cache hit. In parallel 
the local and global protection information is checked. 

For a virtually-tagged cache, the cache tag is checked against the global virtual 
address to determine a cache hit, and the local and global protection information 
IS checked If the cache misses, the global virtual address is translated to a 
physical address by the TLB, which also generates the global protection 
information. 

Local a nd GlohRl Virtual AdrirR<:iCif:.ci 

The 64-bit global virtual address space is global among all tasks. In a multitask 
environment requirements for a task-local address space arise from operations 
such as the UNIX "fork" function, in which a task is dupUcated into parent and 
child tasks, each now having a unique virtual address space. In addition when 
swtching tasks, access to one task's address space must be disabled and another 
tasks access enabled. 

Terpsichore provides for portions of the address space to be made local to 
indindual tasks, with a translation to the global virtual space specified by four 16- 
bit registers for each local virtual space. Terpsichore specifies four sets of virtual 
spaces, and therefore four sets of these four registers. The registers specifv a mask 
selecting which of the high-order 16 address bits are checked to match a 
particular value, and if they match, a value with which to modify the virtual 
address. Terpsichore avoids setting a fixed page size or local address size; these 
can be set by software conventions. 

A local virtual address space is specified by the following:: 



field name 


size 


description 


local mask 


16 


mask to select fields of local virtual 
address to perform match over 


local match 


16 


value to perform match with masked 
local virtual address 


local xor 


16 


value to xor with local virtual address if 
matched 


local protect 


16 


local protection field (detailed later) 
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These 16-bit registers are packed together into a 64-bit register as follows: 

Lccal Translation Lookaside Buffer 

^ f847 32 3> ' 1515 0 

I masklt][i] I match[t][i] | xormril 1 protec tftTriTI 
^ 16 16 76 * 

The LTLB contains a separate context of register sets for each thread. A context 
consists of one or more sets of mask/match/xor/protecc registers, one set for each 
simultaneously accessible local virtual address range. This set of registers is called 
the "Local TLB context," or LTLB (Local Translation Lookaside Buffer) context 
The eftect of this mechanism is to provide the facilities normally attributed to 
segmentation. However, in this system there is no extension of the address range 
instead, segments are local nicknames for portions of the global virtual address 
space. 

For instructions executing at the two least privileged levels (level 0 or level 1 ) a 
tailure to match a LTLB entry causes an exception. This exception mav be 
handled by loading an LTLB entr>- and continuing execution, thus providing 
access to an arbitrary number of local virtual address ranges. 

Instructions executing at the two most privileged levels (level 2 or level 3) mav 
access any region in the local virtual address space, when a LTLB entry matches, 
and may acess regions in the global virtual address space when no LTLB entry 
matches This mechanism permits privUeged code to make judicious use of local 
virtual address ranges, which simplifies the manner in which privileged code may 
manipulate the contents of a local virtual address range on behalf of a less- 
privileged client. 

A minimum implementation of an LTLB context is a single set of 
mask/match/xor/protect registers per thread. A single-set LTLB context mav be 
further simplified by reserving the implementation of the mask and match 
registers, setting them to a read-only zero value: 

63 32 31 16 15 0 

i 0 I xor[t][i] I protect[tnn I 

16 16 Te 

If the largest possible space is reserved for an address space identifier, the virtual 
address is partitioned as shown below. Any of the bits marked as "local" below 
may be used as "offset" as desired. 

6> 48 47 Q 

I local I offset | 
16 38 ~ 

Definition 

def GlobalVA.LocalProtecl LocalViriualToGlobalVirtualAddressTranslation(th va p) as 
LocalTLBMatch «- NONE 
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for i 0 to <<sets per threacJ>>-l 

if (vac3..48 & -LocalTLB[th][i)53 45) = LocalTLB[thl[i]47 33 then 
LocalTLBMalch 4- i 

endif 
endfor 

if LocalTLBMatch = NONE then 
if pi < 2 then 

raise LocalTLBMiss 

endii 

GlobalVA ^ va 
LocaiProiect 0 

else 

GlobalVA <-(va63..48 LocalTLB;th][LocalTLBMatch]3i..i6) II va47 0 
LocalProtecl LocalTLB(th](Loca!TLBMalch]i5 ,0 

endif 
enddef 

Global Virtual Cache 

The innermost levels of the instrucuion and data caches are direct-mapped and 
indexed and matched entirely by the global virtual address. Consequendv, each 
block ot memory data is tagged with access control information and the high-order 
bits of the global virtual address. The current size of the virtual caches is 32 
kilobytes; for architectural compatibility, a minimum size of 8 kilobytes and a 
maximum size of 1 megabyte is specified. The mapping of virtual addresses to 
physical may freely contain aliases, however, provided that either the associated 
regions of memory arc maintained as coherent, or that the low order 20 bits of anv 
virtual cache aliases are identical. (20 bits reflects the size of the maximum IM 
byte virtual cache.) 

A virtual cache tag is contained in the buffer memory (described below). It is 
accessed in parallel with the virtual cache. The virtual tag must match, and the 
control information must permit the access, or a cache miss or exception occurs. 
There is one tag for each cache block: a cache block consists of 64 bytes, so for a 
32 kilobyte cache, there is 4 kilobytes of cache tag information for each cache. The 
protect field shown below is the concatenation of the access, state and control 
fields shown in the table below: 

63 13 12 0 

I global virtual tag \ protect | 

5r~ 13 

Definition 

The following function reads the data, tag, and protection bits from either the 
instruction (c=0) or data (c=l) cache, given a local virtual address. 

def data.GlobalVA.GIobalProlect <~ ReadCache(c.va.si2e) as 

data i~ cacheDataArray[c][vai4..4] 

GlobalVA cacheTagArray[c][vaT4 .6]e3..i6 " '^^^5,.0 

GlobalProtect <- cacheTagAfray[c][vai4 .gjis .0 
enddef 
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Translation and Protection 

Global virtual addresses are translated to physical addresses only upon misses in 
the virtual caches. The translation is performed by software-programmable 
routines, augmented by a hardware TLB, specifically, the global TLB. The global 
TLB labels a cache line with the physical and access information in the virtual 
cache tag. The global TLB contains a minimum of 64 entries and a maximum of 
256 entries. 
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A local TLB, global TLB or virtual cache entry contains the following information. 
The figures in parentheses are the actual size of the field contained, if only a sub- 
field is held in the entry. 



field name 


size 


description 


local 
TLB 


global 
TLB 


cache 
tag 


virtual 
address 


64 


virtual address (lowest 
address of region) 


✓ (16) 


*^(58) 


*^(50) 


virtual 

address 

mask 


64 


mask for virtual address 
match 




•^(58) 




physical 
address 


64 


physical address xor 
virtual address 


✓ (16) 


*^(58) 




reserved 




additional space in 
register 




✓ (2) 


^ (1) 


caching 
control 


2 


are accesses to this 
region incoherent (0). 
coherent (1). no-allocate 
(2), or uncached- 
physical(3)? 


✓ 


✓ 


✓ (1) 


detail 
access 


1 


do portions of this region 
have access controlled 
more restricriveiy? 


✓ 


✓ 


✓ 


access 
ordering 


1 


are accesses to this 
region ordered weakly (0) 
or strongly (1) with respect 
to other accesses? 


w 


w 


w 


coherence 
state 


3 


if region is coherently 
cached, does coherence 
state permit read(4), 
write(2). or 
replacement(l)? 
if region is not coherently 
cached, does cache state 
require write-back(2) or 
not{0)? 


✓ 


✓ 




read 
access 


2 


minimum privilege for 
read access 


✓ 


✓ 


✓ 


write 

access 


2 


minimum privilege for 
write access 


✓ 


✓ 


✓ 


execute 

access 


2 


minimum privilege for 
execute access 


✓ 


✓ 


✓ 


gateway 
access 


2 


minimum privilege for 
gateway access 


✓ 


✓ 


✓ 



information in local TLB, global TLB. or virtual cache entry 
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The bottom section of the table above indicates the contents ot the 16-bit 
protection tield: 



15 IJ 13 12 11 10 i 



[EE 

1 



cc 



Idajaol 

1 \ 



cs 



4 3 



2 1 



w 



15 14 13 12 11 O = 7 



cc 



cam 



cs 



1 1 



w 



4 3 

in 



2 1 



2 2 2 

P-orecnb^ ^forrr?^rr^r /n r-c;rr: ;r-r- r^ar. 

6 5 4 3 



13 12 n 10 
I CC I da I i 



ao 



CS 



6 5 4 3 2 1 

I ' ' - r 



/3 , . 1^ .10 5 7 6 5 4 3 

|cc|da 



ao 



1 1 1 



cs 



r-r-T 



2 1 



w 



T-j-l 



Protecticn informpjin n in nn\.<:r.Rj c/grr^ r^.y 



5 4 3 2 ^ 

jccldajaol cs | 

111 



Memory Interface 



Dedicated hardware mechanisms are provided to fetch and store back cj:j biocKS 
in the instruction and data caches, provided that a matching entry can be icjrd :r 
the global TLB. When no entry is to be found in the global TLB. an exce-::c>r. 
handler is invoked either to generate the required informarion from the virtjj; 
address, or to place an entry in the global TLB to provide for automatic handhna 
of this and other similarly addressed data blocks. 

The initial implementation of Terpsichore partitions the remainder of the local 
memory system, including a second-level joint physical cache and a DIL-VM-based 
memory array, into a set of separate devices, called Mnemosvne. These devices 
are accessed via a high-band\xndth, byie-wide packet interface, called Hermes, 
which is largely transparent to the Terpsichore architecture. The Mnemosvne 
devices provide single-bit correcnon. double-bit detection ECC on all local 
memory accesses, and a check byte protects all packet interface transfers. The 
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dependent. '"'^ ^"^>- ^mplen^enucon- 

Cache CoherRnnf^ 

The Terpsichore processor is intended for use in either a uniprocessor or 
multiprocessor env.ronment At the h.gh performance leve^ ntendcd fo 

Sh^;"ncr"uSrt^"ron^^ P-cessor^destgns to ma"nt.rnlch 

^^nfr , be effectivelv buOt. because the 

Z^^Tr r 't'""'- ""'^ bandwidth between processors would be excess'vi 
Several cache coherence mechanisms have been des.ened for hi-h-performance 
processors that do not require that aU memory transactions be broadc t amon 
the processors m a system^ one ot which is the Scalable Coherem Interfi7o? 
SCI. as specihed by the IEEE Standard 1596-1992. incerrace. or 

"""^^ coherence mechanism is extremely complex. Manv of the cache 
conerence operations take time proporrional to the number of processors imolved 
m the operation, and so implicitly assume that the number of processor sh" nn. a 

h^r^;. ■ ^9 " ^°"5-aering an even more complex mechanism 

that may assure logarithmic growth in time comple.xitv). Most imooaandt no 
complete working prototype of this mechanism has been bJt tested and 
benchmarked at the time of development of the Terpsichore architecture 

imnir^ri^'-'"'^- "^^e coherence mechanism should not be 

imp emem d in immutable, hardware state machines. A software implemen^tion 
ot a cache coherence protocol is proposed, which given the high intrinsic 
performance of the processor is likely to reach nearlv the performance leJd "hat 

IZ^^^n^rl^Lf^ "t''^'^° •'"P^^^-^ performance of 

experimental non-bus-onented cache coherency operations. 

Cache coherence information is available within the local TLB global TLB and 
leiror <-frf;M.'t'[ =°f"ence operations may be performed at task-level, page- 
SL^ "che-block level as desired. This flexibUity provides for a coherent view 
ot memory m muluprocessmg systems uith varying degrees of couphng. 

Physical Addm<=;<=;nF: 

L'rrJs Kl??HT„r ^ ^"^ °f ^ 16 bit processor node number 



and a -48 bit address. 

63 48 47 

> "o**^ I LocalAddr ess 

16 ^ 

Physical addresses in which the node number is zero reference the local 
processors local memory space, providing access to local memorv. cache tass 
system and mterlace hacilities. Physical addresses in which the node number^is 
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nonzero reterence other processorj" local memorv spaces using the Hvdra 
interlace tor communication. " P'*'^^^" ">°" 

Srnl!-lr'"°7 Terpsichore involves the use of up to txvcive 

Hermes bvte-wde packet communications channels, bv which Terpsichore can 
S?1''t ^"n^^ctions tc Mnemosyne. Calliope, and Hvdra deScs n 

addition Terpsichore can issue read or u-rite transactions to the Cerberus serial 
bus mtertace. via .h.ch the Mnemosvne. Calliope. Hvdra and other devfce ' 
configuration and control registers can be accessed. The diaaram iUustrates shows 
one possible Terpsichore memory environment: " 



J 1 t 1 t 1 



Terpsichore processor 




Minn M :rcj^^H 




Hermes and Cerberus interfaces 



Hermes channels 0..7 are always used to connect Terpsichore to Mnemosvne 
memory devices. Hermes channels 8.. 11 may be used to connect Mnemosvne. 
Uaiiiope. or Hydra devices. 

Terpsichore provides three different mappings of rhe local memorv environment 
in the loca physical address space. The non-interleaved space provides tor the 
access ot all Mnemosyne. Calliope, and Hydra device memorv spaces such that 
each device appears as a single continuous space. The uniprocessor spaces 
provide for the interleaved access of one. two., or four sets of eight Mnemosvne 
devices on separate Hermes channels as a sinele continuous space the 
multiprocessor spaces provide for the interleaved access of one. two. or four sets 
ot nine Mnemosyne devices on separate Hermes channels as a single continuous 
space witti the ninth channel used as a cache coherency directory. " 

63 48 ^7 dO 39 n 



[ 



node 

16 



[space r 



SpaceLocal Address 

40 
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The value of the space field determines the inierpretanon of the 4S-bit 
address held as given by the tolloui-g table: 



local 



value 
0 


interpretation — — — " . 

non-interleaved. Hemes cnanr.Pi n i ^r-~nZ 


1 
2 


non-interleaved Herrr.es cnann^i ft n cr-'t^o — 

bxI'Way interleavsc ur:iDroc3c;=;nr mornoM/ 


3 




4 




5 


v^/vf. way H iiciicavrrj • - uiiiDrocessor memcrv SDaCCi 


6 


wwuy iHn^MtJdv^j .J. .iL.'ouessor meiTior'/ soace 


7 


wdv nufcjrieavec "'^iiiorocessor memory soac^ 


8 




9 




10 


4A i«^;y/dy iMiciiKdv-:^ J.nicrocessor memory space 


1 1 


:DXi-way interledv-u jinDfCcessor nemorv c;p;:,rp 


12 


^x^-wav inierieavea •-'*:Droce5 memory soac** 


13 


^^^-way = • JUiorocess-. ' -.emcry soac = 


14 


4x4-way interleavec 'jninmrpssor mp -nnr- 


15 


yx^-way interieavea muinDrocessor mem.-/ space 


16 




17 . 




18 


2xl.way interieavea 'jniDrocessor memcrv soace 


19 


oxi-way interteavec multiorocessor memory space 


20 


^x^-way interleaved uniprocessor memory space 


21 


3x2-way interieavea multiorocessor memorv soace 


22 


^x4.way interleaved uniorocessor memorv soace 


23 


3x4-wav interleaved multiorocessor memory soac^ 


24. .31 


Cerberus memory ana control space 


32, .255 





The non-inrerieaved Hermes channel 0..7 space provides a sinde continuous 
memorv space tor each device xn Hermes channels 0..7. Mnemosvne protocols are 
used. Unly incoherent accesses arc supported (no memory direcror^• tags). 

47 40 39 3 7 3635 34 

I" I addr 

2 



s=0 

8 



c 

3 



3 2 0 



32 



ID 
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"ng^cabL''^'* """^ ''''' '"^^^P^^^^^^°" t.eids .s guen by the 



tieid 


value 


1 inieroreranon 


s 


0 


Specify ncn-interieavea Hermes 
channel 0.7 soace 


c 


0.7 


Hermes channeis 0.7 


m 


0.3 


Module aaaress 


addr 


0..2^S-1 


Logical memory block address 


b 

nr»rk_ 


0..7 


Paa tor conversion ot byte address to 
block ac:iress 



The non-interleaved Hermes channel 8 .11 space provides a sinde continuous 

ShooeStdra nT' t"" channels 8..H. Bther rMne.rne 

Lalliope/Hvdra protocols may be speatred. Onlv incoherent accesses are 
supported (no memor\- directory tags). Jcccises are 

dO 33 3837 35 35 3^ 

Nolmj 



47 



8 



1 2 2 



addr 

32 



3 2 0 



field 


value 


interpretation 


s 


1 


Specify non-interleaved Hermes 
channel 8. .11 space 


h 


0..1 


0: use Mnemosyne protocol 
1: use Calliooe/Hydra orotocol 


c 


0..3 


Hermes channels S..".! 


m 


0..3 


Module address 


addr 


0..232.1 


Logical memory block address 


b 


0.7 


Pad for conversion of byte address to 
block address 
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(?'7;"T.nlortnr'"l'^"'"\''^ '"'""^"^"^ ^''""''^ « channels 

«U^./». supporting onJy incoherent accesses (no memorv directorv ta^si 
Mnemosyne protocols are used. cv.iui> ij_si. 



a? 



1 s=2 In 




6 5 3 2 C 


8 2 


addr 

52 


1 « 1 M 

3 3 


1 S=4 |n| 


addr 


755 32 0 


8 1 


32 


H c 1 b 1 

1 3 3 


40 39 
1 s=6 1 


addr 


8 7 65 3 2 0 


8 


32 


2 3 3 


The mrcrleayed spaces described below incerieave between 4 Hermes channels 
MninTn. ' ^"PP^^^^g onlv mcoherenc accesses (no memorv directorv- tacs' 
Mnemosyne prorocols are used. ^^^^^^y iJ:,5i. 


1 SslO Idini 




5 4 3 2 0 




addr 

32 


2 3 


4 7 40 335837 




6 5 4 32 0 


8 1 1 


addr 

32 


mi±j 

1 2 3 


47 40 303Q 


addr 


7 6 5 4 3 2 G 


8 1 


32 


Imlcl b 1 

2 2 3 



The interleaved spaces described below interleave between 2 Hermes channels 

Ly vi„- °' ^"PP°"»"S 0"^>- incoherent accesses (no memorv directorv 
lags). Mnemosyne protocols are used. ' 

47 4Q J938 3736 

47 40 39^5 3^36 



1 3 

» s.20 "id|nr addr ^ Hd b H 

8 n 32 '.V ' 



47 40 37 

8 2 



1 1 3 
6 5 43 2 C 

ImN b I 



32 2 13 
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The range of valid values and the interpretation of the fields is given bv the 
toUowing table: " • 



field 


value 1 inrerDretation 


s 


12.14.18 
20.22 


Sceciry uniprocessor interleaveo 
space 


d 


0..3 


Hign-oraer bits of Hermes channel 
numcer 


c 


0..7 


Low-order bits of Hermes channels 
number 


n 


0..3 


Hicn-order bits of Hermes module 
address 


m 


0..3 


Low-order bus of Hermes module 
address 


addr 


0..232-1 


LcGicat memory block address 


b 


0..7 


Pao Tor conversion of byte to block 
address 



interleaved space fieid interpretation 

The Hermes channel number is constructed by concatenating the d and c fields: 

5 3 ' 3?^ 3 a»3»3 
t c I or |d[c] or 
3 12 2 2 

The Hermes module number is constructed by concatenating the n and m fields: 



37 36 



-•4 



. _ 5 4 

[nj or ^ or |m] 

2 11 2 

Multiprocessor lntf?rleaved f^n^n^^ 

The interleaved spaces described below interleave between 9 channels for 
multiprocessor. 

47 40 37 

I s=3 Inl ~ 

7 2 



6 5 3 2 0 



addr 



32 



47 



c 



40 38 



7 65 3 2 0 

c I b 



s=S H 

7 1 



addr 



He Ibl 



32 



47 40 39 



1 3 3 

8 7 65 3 2 0 



addr 

32 



|m| c I b I 



2 3 



205 



Case 2:05-cv-00505-TJW Document 149 Filed 10/15/2007 Page 9 of 40 

wo 97/07450 PCT/US96/I3047 



JoLl^r^llj"^''^ '""^ interprecat.on of chc helds is given by the 



fieid 


value 


inrercrerarion 


s 


3.5.7 


Soecifv interleaved soace 


c 


0..7 


Mnemcsyne channels 0.7. before 
modificaticn described below 


n 


0..3 


Hign-orcer Dtts of Hermes module 
address 


m 


0..3 


Lovv-orrsr oits of Hermes module 

address 


addr 


0..232-1 


Logical memory block address 


b 

inr 


0..7 

A rl A t 1-^ 


Pad for conversion or oyte to blocK 
address 



For the muhiprocessor space :he channel number field is modified bv the lou- 
order memory block address bits according to the following tables In udduTo^ 

channel specified m the tables below. ' " i-i?rmes 



addr2..o 






C 


tag 




0 


1 


2 


3 


4 


5 


6 


7 


0 


8 


1 


2 


3 


4 


5 


6 


7 


0 


1 


4 


5 


6 


7 


0 


8 


2 


3 




2 


8 


3 


0 


1 


6 


7 


4 


5 


2 


3 


6 


7 


4 


5 


2 


8 


0 


1 




4 


1 


0 


3 


2 


5 


8 


7 


6 




5 


8 


4 


7 


6 


1 


0 


3 


2 




6 


3 


2 


1 


0 


r 


8 


5 


4 




7 


8 


6 

n . 


5 


4 


3 


2 


1 


0 





table 

The memory tag entry is an octlet value for each 64-bvte memorv biock T-e 
contents of the tag is mterpreted by Terpsichore hardware to sigmrV r.. 3 ^^-c 
value indicates that the memory block is not contained in the cache' r • j vai.- 
equal to the xytual address used to access the memory block indicates that the 
value is cached at that address, and (2) any other values indicates that the value 
may be cached m multiple or remote locations and requires software intervention 
tor interpretation. 

Thus a read to a memory block accesses the tag. and if the value is zero, fills it with 
the virtual address via which the access occurred. When the memorv block is 
returned to memory-, the lag is accessed, and if the value is equal to the virtual 

t MA u '''r '""'^ ^" ^" exception occurs, which is 

handled by software to implement the cache coherency mechanisms 
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We also need to have a space available in which access to the tag via software 
routines is straightforward - the non-interleaved space makes the^tag available 
but not conveniently. 



OS 

The Cerberus serial bus space provides access to a memorv space in which 
Bootstrap ROM code. Terpsichore. .Mnemosyne and Calliope configuration data, 
and other Cerberus peripherals are accessed. The Cerberus serial bus is specified 
by the document: "Cerberus Serial Bus Architecture."' Terpsichore confiauration 
data IS accessable via Cerberus as a slave device as vveO as via this address'space. 

y ^^3^2 27 26 19 13 2 0 

i 3 I net I node | addr TTn 
5 Te s 15 3 



The range of valid values and the interpretation of the fields is given bv the 
roiiowing table: 



field 


vaiue 


mteroreiation 


3 


3 


Specffv Ceroerus scace 


net 




bpeciry Cerberus net address 


node 


0..255 


Cerberus node address 


addr 




Logical memory block address 


b 


0.7 


Pad for conversion of byte address to 
block address 



Cerberus space iieio interpretation 



Control Register Adcressss 

This section is under construction. 

Local TLB (4 octlets) 

Virtual cache tags (2k-128k x 2 caches) 

Virtual cache data (16k-1M x 2 caches) 

Global TLB (4 octlets x 64-256 entries=2k-8k bytes) 
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Events and ThrGRciR 

he sottYre or hardware, such as aruhmenc overflow or pantv error P) evenS 

Xio.^-- -.re c^^ve acno. ^^^^ 
et:L°' ' compurariond process, sach as con,ple:ion ^f TSllk ;r3nsf"Tr 

eitrl'^handlin/«f^fK'''^ interruption of the current flow of 

:;orr\Te ieSore?- 

°ecfster L sta te^ T^ jncludes cne program counter, stack space, and other 

l-VTY^ J^\. ?. ° ^ a Mach thread exactiv equals one 

,hiS/n^"'- '""^ ^^''^ "^^^^^ ^"^^^-^ ' '^'^ '° be associated with seveS 
ti:d^'Ju"nlV"""°^ " °" "^'"^'^^ one task^viS^oni 

In the taxonomy of events described above, the cause of the event mav either be 
synchronous to the currently running thread, generally tvpes T ^ and 3 or 

rn-n'—rdlv Tett T^^'" '''' ^^^^ ^ not^currentK- 

running, generally type 4. For svnchronous events. Terpsichore will susoend the 

s th^arisTd-;!:';" e.x:cut;;:s':not ' 

thread that is dedicated to the handling ot events. For asynchronous events 
Terpsichore will continue e.xecution wuh the dedicated event thread whtle nor 
ncccssanly suspending the currcndy running thread. 

Terpsichore provides sufficient resources for the interleaved execution of at least 
le'i^r on. "'^^ '^°"""''"! 6^ ^^gi««s and a program counter and 

counter. \X hen both threads are able to continue execution, priority is gcneraUv 
given to the event threads. • e«:ncrau> 

;'K^iicf exception, mcrnory management, and interface svstems are 

S firir'»!"°j!^l '"?^''^' '° P'^^''^^ manipulation of these 

lacihties by high- evel language, compiled code. In particular, the thread 
resources of the full threads are mernor>-mapped so that'the exception threads 

threads ""'"'^ ^ '"'^ P^^""" °^ '^"^ 
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Events are singlc-bit messages used co communicate the occurrence ol exceptions 
between tull threads and event threads and interlace devices. 



63 



[ 



event 



] 



The evrnt register appears at several locations in memory, with sliahtlv different 
side ettects on read and write operations. ' ^ . ^ 



offset 


side effect on read 


sics enect on write 


0 


none: return event register 
contents 


normal: write aata into event 
register 


8 


stall thread until contents or 
event register is non>zero. then 
return event reaister contents 


stall thread until bitwise ana of 
data and event register contents 
is non-zero 


16 


return zero value (so reaa- 
modify-write for byte/dcublet/ 
quadlet store works) 


one bits in aata set (to one) 
corresDOnding event register 
bits 


24 


return zero value (so reaa- 
modify-write for byte/dcublet/ 
auadlet store works) 


one bits in naia clear (to zero) 
corresponding event register 
bits 



mterfacc devices signal events bv responding to non-blocking read requests 
generated by a write to a Terpsichore control register. The response to these read 
requests is combined into the event register with an inclusive-or operation 

r " 



event daemon address 



54 



J 



A write to the event daemon address register causes Terpsichore to issue a read 
request to the corresponding physical address. The device referenced bv this 
request may respond at any future time with a value, which is mclusive-or ed into 
the event register. 



The following notes list the resources needed to support the threads... 

Events: 

full thread 0 suspended at instruction fetch because of exception 
full thread 0 suspended at data fetch because of exception 
full thread 0 suspended at execution because of exception 
full thread 0 suspended at execution because of ennptv pipeline 
same for full threads 1-3. 
tinner, calliope, and hydra events 



0 
1 

2 

3 

4-15 
16-63 
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Thread resources. 

General registers at data fetch stage 
. General registers at execution state 
Program counter, privilege level at data fetch staqe 
Program counter, privilege level at execution staqe 
Mask register: events wmch permit the thread to ?un 
Mask register: events which prevent the thread to run^ 

forKieffSf ssf "'"^ 

Exception information registers: 
exception cause 

instruction which caused exception 
virtual address at which access attempted 
size of access attempted 
type of access (read, write, execute, gateway) 

Sort by stage: 
Inst fetch stage: 

program counter 

suspend (drain queue). 

reset(clear queue). 

. , . ^ proceed past detail 
Data fetch stage: 

General registers 

program counter, privilege level 

control register: 

suspend (drain queue). 

reset (clear queue) 

proceed past detail 
exception state: causeOncI access type, size, boundary. 

local TLB hit indication), inst 
can compute local va from GR. inst: 

shift-and-add-load-shiftl-shiftr-add (7) 

r^n ?n!i?rF F"'^^' T ^^'^•••=^ ^^^^ Qlobal va register 

nfpfprr?i^ H T-^'"""^ ''^-^^^^ Shift and add load (4 ) 
prefetched data, instruction queue 

clear queue 

drain queue 

Execute stage: 

General registers 

program counter, privilege level 

control register: suspend 

exception state: cause(flt/fix arithmetic), inst 
Exceptions: 
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number 


exceotion 


0 


Access aisallovved cv tag 


1 


Access detail required bv taa 


2 


Cacne coherence aciion reauirea bv taq 


3 


Access disaiiowea bv virtual aadress 


4 


Access disallov/ed cv global TLB 


D 


Access aetait reauirea by Qlobal TLB 


6 


Cacne coherence action required bv oiobal TLB 


7 


Glooal TLB miss 


8 


Access disaliowec cv local TLB 


9 


Access detail reauired by local TLB 


10 


Cacne coherence action reauired by local TLB 


1 1 


Local TLB miss 


12 


Floating-point ariinmetic 


13 


Fixed-point arithmetic 


14 


Reserved insiruccion 


15 





Parameter passing 



There are no specjal registers to indicate details about the exception, such as the 
virtual address at which an access was atiempted. or the operands of a floarin<^- 
point operation that results in an exception. Instead, this information is available 
via memory-mapped registers. 

When a synchronous exception occurs in a full thread, the corresponding thread's 
state IS trozen. and a general event is signalled. An event thread should handle the 
execepiion. m whatever manner is required, and then mav restart the full thread 
by writing to the full thread's control register. 

\yhen a synchronous exception occurs in an event thread, an immediate transfer 
ot control occurs to the machine check vector address, with information about the 
exception available in the machine check cause field of the status register. The 
transfer of control may overwrite state that may be necessary to recover from the 
exception; the intent is to provide a satisfactory post-mortem indication of the 
characteristics of the failure. 

Exceptions in detail 

This section is under construction. Terpsichore has changed from passing the 
parameters in registers to passing the parameters in memory-mapped registers, 
and the information in this section doesn't reflect the changes yet. 

This section describes in detail the conditions under which exception occurs, the 
parameters passed to the exception handler, and the handling of the result of the 
procedure. 
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Access disailcwRfi hv t^n 

This exception occurs when a read (load), write (store), execute or -atevvav 
attempts to access a virtual address for which the matching virtual cach'e entrv 
does not permit this access. 

int AcccssDisallowedByTag(int address, int size, int access) 

The address at which the access was attempted is passed as address. The size ot 
the access in bytes .s passed as size. The type of access is passed as access, with 0 
meaning read, meaning write. 2 meaning execute, and 3 meaning gatewav. The 
exception handler should determine accessibility, modiiv the virtual memorv state 

r^l'rr'S 1"'"'" '^l u""' '""^^'^ "P^" ^^^^^^^^ 

restarted and the access wdl be retried. 

Access n etaii r-ni jirRri h\/ tsr>. 

f a read (load), write (store), or execute attempts to 

?hi?.rr.«rlh ^'"' /u- '""''^^"S ""^^^ ^"">' permit 

this access, but the detail bit is set. 

Prototvcs 

int AccessDctailRequiredByTagdnt address, int size, int access) 

Descroticn 

The address at which the access was attempted is passed as address. The size of 
the access in bytes is passed as stze. The type of access is passed as access with 0 
meaning read. 1 meaning write. 2 meaning execute, and 3 meaning gatewav The 
exception handler should determine accessibility and return if the access should 
be allowed. Upon return, execution is restarted and the access will be retried If 
the detail bit is set m the matching vinual cache entry, access will be permitted. 

Cache c oherennf^ action rRnuirsrf hv tng 

This exception occurs when a read (load, execute, or gatewav), write (store) or 
replacement attempts to access a virtual address for which the coherence state of 
the matching virtual cache entry cannot permit this access. 

int CacheCoherence InterventionRequirediint address, int size, int access) 
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' 



The address at which the access was auempted is passed as address The size of 
the access m b>7es is passed as size. The type of access is passed as access, wuh 0 
meaning read. 1 meaning ^vrite, 2 meaning replacement. The exception handler 
should modity che cache status to make the cache line accessible. Upon recu n 
execution is restarted and the access wUl be retried 



attempts to access a virtual address tor which the matching global TLB enirv does 
not permit this access. " • 



int AccessDisaUowedByGlobalTLBiint address, int size, inc access) 
DescriryPr'r 

The address at which the access was attempted is passed as address. The size of 

rn!.i?n"' 'T" " ''""^'^ ^^'^ ^"P^ °^ ^^'"'^ P^^^d aS ^CCeSS. With 0 

meaning read. meaning write. 2 meaning execute, and 3 meaning gatewav. The 
exception handler should determme accessibility, modify the virtual memorv state 
if desired, and return il the access should be allowed. Upon return, e.^cecu'rion is 

SLTAb". l^" ^nw^' ''"'"^ J^ '^"^'^ b« »^ - ^he matchm 

global TLB entry, access will be permitted 

/Access detail rRnuired hv cicbsl Ti R 

This exception occurs when a read (load), write (store), e.xecutc or iratc-vav 
attempts to access a virtual address for which the matching global TLB er-v 
would permit this access, but the detail bit in the global TLB entrv- is set. 

int AccessDetaiIRequiredByGlobalTLB(int address, int size, int access) 
Descnotion 

The address at which the access was atrempted is passed as address. The size of 
the access in bytes is passed as size. The type of access is passed as access with 0 
meaning read. I meaning write. 2 meaning execute, and 3 meaning gatewav The 
e.xception handler should determine accessibility and return if the^ access should 
be allowed. Upon return, execution is restarted and the access will be forced to be 
permuted. If the access is not to be allowed, the handler should not return 
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This exception occurs when a read (load, execute, or satevvav) wrirr u-nr.. 



int CacheCoherence Inren-encionRequirediint address, int size, inr access) 

return, e.xecution is restarted and" the access wSl be .cS 

dobs! "TLj? m/crg? 

P'-ctofvng 

void GlobalTLBMisslint address, int size, int access) 
Descriotinn 

JthJ'lil^'' t'"^'^^ S'^bal TLB miss occurred is passed as address The size 

Access^ ciisallo\A/f^ri bv lon^l 77 R 

r pe^r tSracc:ss"""^ ^^•'^'^'^ ^•^^ '"-^"S ^^B ent'r>- does 

int AccessDisalJovvedByLocaJTLBCint address, int size, int access) 
Descriotinn 

ItjlitjnV^J^^ access was auempted is passed as address. The sue of 
the access m bytes is passed as size. The type of access is passed as access, with 0 
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meaning read. 1 meaning write, 2 meaning execu:e. and 3 meaning gatewav. The 
exception handler should determine accessibility, modiiy the virtual memory siace 
if desired, and return if the access should be allowed. Upon return, execution is 
restarted and the access will be retried. 

Access ceiaH reouired ov ioca^ "^,5 

This exception occurs when a read »load). write (store), execute, or gateway 
attempts to access a virtual address for which the matchma local TLB encr^ would 
permit this access, but the detail bit in the local TLB entry'is set. 

ini AccessDetailRequiredByLocalTLBtinc address, int size, int access) 

Descr!ot:<:yn 

The address at which the access was attempted is passed as address. The size of 
the access in bytes is passed as size. The type of access is passed as access, with 0 
meaning read, I meaning write, 2 meaning execute, and 3 meaning gatewav. The 
exception handler should determine accessibihty and return if the access should 
be allowed. Upon return, execution is restarted and the access wiU be forced to be 
permitted. If the access is not to be aUowed. the handler should not return. 

Cache cohemncR action recwed bv iocal Tl.3 

This exception occurs when a read (load, execute, or gateway), write (store), or 
replacement attempts to access a virtual address for which the coherence state of 
the matching local TLB entry cannot permit this access. 

Prototvce 

int CacheCohercncc IntervcniionRequiredlint address, int size, int access) 
Descriotion 

The address at which the access was attempted is passed as address. The size of 
the access in bytes is passed as size. The t\-pe of access is passed as access, with 0 
meaning read, 1 meaning write. 2 meaning replacement. The exception handler 
.should modify the virtual memory state to make the local TLB accessible. Upon 
return, execution is restarted and the access will be retried. 

Local TL3 miss 

This exception occurs when a read (load), write (store), execute, or gateway 
attempts to access a virtual address for which no local TLB entry matches. 

ProtoP/oe 

void LocalTLBMisstint address, inc size, ini access) 
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Descriotan 



the access m bytes is passed as si:e. The type ot access is passed as access with 0 
meaning read mcamng wnte. 2 meaning execute, and 5 meaning gatSva The 



ricsting-ooint shthrretic 



quad FloatingPointArithmeticiint inst. quad ra. quad rb. quad rc) 

Oesarinr-rn 

The contents of the instruction which was the cause of the exception is passed as 

r.ceotion h' nT'^K °\r-'''"' ""^ P"-'^ - - -d " Th 

?n,3 A '"'""P' '° P"^^^"' spechied in the 

instruction and ser^•lce any exceptional conditions that occur. The result of he 
function IS placed into register rc or rd upon return. 

Fixed-o oint srithmffpn 

ProToh/no 

int FixedPointArithmeuc(int inst, int ra, int rb) 

Descrint'nn 

The contents of the instruction which was the cause of the exception is passed as 
i Lt^ L M P"^^d as and S,. Tne exception 

handler should attempt to perform the function specified m the mst^uctlon^nd 
service any excepuonal conditions that occur. The result of the function is placed 
into register rb or rc. ^ 

i=^es3n/f?( i InFitn intinn 

int Reser\-edInstruction(int inst, int ra. int rb) 

Description: 

The contents of the instruction wh.ch was the cause of the exception is passed as 
and the coijtcnts ot registers ra and rb are passed as ra and rb. The rcsuh of 
the runction is placed into register rd. 
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Access pfsaHGwed bv v:-^ual ac:d^^f=^<^ 

This exception occurs when a load, score, branch, or gateway refers to an aligned 
memory operand with an improperly ahgned address. ^ 

^r/^ T^r\ ' 

inc AccessDisallowedByVircualAddrcssdnt insi. inr address) 

Cescrcticn: 

The contents of the instruction which was the cause of the exception is passed as 
inu. and the address at which the access was attempted is passed as address. 

Clock 

Each Euterpe processor includes a clock that maintains processor-clock<vcle 
accuracy. The value of the clock cycle register is incremented on everv cvcle. 
regardless of the number of instructions executed on that cvcle. The clock cvcle 
register is 64-bits long. 

For testmg purposes the clock cycle register is both readable and writable, though 
m normal operation it should be written only at system initialization time; there^is 
no mechanism provided for adjusting the value in the clock cycle counter without 
the possibility of losing cycles. 

63 n 



c 



clock cycle ~| 



D4 

Clock Event 

An event is asserted when the value in the clock cycle register is equal to the value 
in the clock match register, which sets the specified clock event bit in the event 
register. 

For testing purposes the clock match register is both readable and writable, 
though in normal operation it is normally written to. 

63 -0 

i clock match | 

64 

63 6 5 C 



64 



clock 
event 
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Watchdog Timf?r 

A Machine Check is asserted when the • aiue in tKp ^u^l i • . . ' 
the value in the watchdog timer register ' '"^"'"'^ ='5"^' 

time K IS written. • register oetorc the next 




Tally Coijnf(=ir 



e^ntslro^eraSr^^^^^^ ^^^^ processor-related 

each processor dock cvcle in whkh . "y>/°""«^ ^^e incremented on 

count'er register^dt n't^nVEtr;^^^^^^^^^^ °^ ""^ 

iount^r'rSsteS'irfl^^^^^^^^^ ^-P'^-^d so that the taUv 

sufficient for ^ 4 rwl T u xu'^ frequently than once per second. 32 bits is 
ZlL2^Lf:tonl^~ unimplemented bits must be zero 



tally counter 0 
64 " 



] 



83 

^ tally counter 1 



64 

^o"taliy"' "^'"^ °f 'he event counters 

p- 32 3J 16 15 0 

I — ' 2 Itallv control 0{taily controTTl 

32 T? Tg ' 
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The 



valid values for the tally control fields are given by the toUowing tabic 



value 

0.63 



64 
65 



66 



67 



68 



70.. 
65535 



interpretation 



tally events 0..63 



freeze counter: count noihinq 



tally instructions processed by aociress"unF 
tally instructions processed by execute unit" 



tally instruction cacne misses 



tally data cache misses 



tally data cache references 
Reserved " " 



tally control field interpretation 

Control R eoistRr ^ryr^r^ccoo 

This section is under construction. Software and hardware desioners should not 
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pnvsicai aadress 


description — — ^— 
event rea7ster 

event reaister v.!:-. srail " 

event register en set ~ 

event register c:: iiiar ■ 

event daemon =c-ress 

lull thread 0. qer='ai reaister o. data fetch staa^ 

tun thread 0. cere-ai reoister 63. data fetcn stanp 
felch'stat'e''' P^>^i\ege level, data 
_full threaa 0. ccr:-ci reaister. data fetch state 




lull itueao u. s:=:j3 reaister. data fetch state 
lull inreao u. c5-i.-ai reaister u. execution stage 




full thread 0. ce-e-=i reaister 63. execution stao* 
lull mreaa u. p.-c.-am counter ana pnviieae level 
execution stats 

full thread 0. ccr:rDi reaister. execution stage 
tull inreaa U. srst-s register, execution staae 




eveiu mreao u. ce-e.^ai reaister 0. data fetch stao^ 
event thread 0. ge.neral reaister lb. data fetch stanP 
data fetch state 

event thread 0. ccntrol reaister. data fetch siax^ 
event thread o. s:atus register, data fetch statP 




event threaa u. cenerai register 0. execution s;as= 

event thread o. csneral register 15. execution 3-ai= 
cvcni ihreaa u. program counter and privileae ie»*. 
cfAecuiion state 

eveiu iiiitfdd U. control register, execution staa^ 




cvciii lurtjda u. siatus reaister. execution stano 

local TLB entry Q *~ 

local TLB entry 1 ~ 




local I LB entry 2 ' ' ' 




local TLB entry 3 ' ""^ — 




clock cycle reaister 




clock match reaister 




clock event ccntroi reaister — 




tally counter 0 ' — 




tally counter i 




tally control O/l ~ ~ 
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Reset and Error Rf^nnw^n , 

Certain external and interna! events cause che Euterpe processor to invoke reset 
or error recovery operations. These operations consist of a full or pa tiaUeseTo - 
critical machine state, including initialization of the event thr^n^ r« k f I 

irf "-j -t'"^- 5°°' d'r^l.^e^hfn 

he reset or error by reading the value ot the Cerberus control register in vvLh 
hnding the reset bit set (1) indicates that a reset has occurred ISn^the clear bit 
set (I) and the reset bit cleared »0. indicates that a locic clea has occurred and 
hnding both the reset and clear bits cleared (0) indicat^ that a machine check h"s 

of T/r. K^'" ' °' '"^^^■^'^-^ '^''^ been indicated le comem 
ol the Cerberus status register contains more detailed information on the cause 

Reset 

A reset mav be caused hv i Pprkon.. i ^-^ , 

addresses ^o etToh^^^^^ the local TLB to translate all locaAS 

s°ft!^rr'rhi/"evnr -t 'I- ''"^ "^"^ ^■''P''^«lv '""ialized bv 

sottuare. this explicitly includes the main thread state, global TLB state 

imertjce de\ ces. The code at the start vector address is responsible for mitializino 
these remaining sv-stem facilities, and reading further bootstrariodVfZ ^ erie! 
oi standard interface devices to be specified: 

Power-nn R^.g^f 

;hfr.rKf povver-on. Tne cause of the reset is noted bv initializing 
the Cerberus status register and other registers to the reset values noted below. 

A reset occurs upon observing that the Cerberus SD data signal has been at a 
ogic low level for at least 33 cycles or the Cerberus SC clock signal. The cause of 
he reset is noted by initialumg the Cerberus status register and other registers to 

the reset values noted below. " 

Cerberus; Cnntrnl R ^istsr Pf^c^t 

A reset occurs upon writing a one to the reset bit of the Cerberus control register 
Tne cause of the reset is noted by muializing the Cerberus status reaister and 
other registers to the reset values noted below 
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A reset occurs if Che temperature is above the threshold set bv the meltdown 
margm held of the Cerberus contiguntion register. The cause of the resTt s noS 
by setting the meltdown detected bit of the Cerberus status regiter 

Dcubff? Mp chirf^ Ch^ck Pp^^r 

A reset occurs if a second machine check occurs that prevents recovery from the 
tirst machine check. Specit.ca%. the occurrence of an exception Tn even Tead 

Tausefis'Ttm sTin ""r ' "^^^ ^^''^'^ macSme check 

Che k re e Th cause of rk. ""' ^"^"f register results in a double machine 

bh of th: aTbe^STatul regis"? ^"""^ ''^'''^ ^-"^-^ ^'^^^ 

Clear 

T1i?c l\t^rl° ^^^^""^ r^g'"<=- a losic clear 

.nH I T'" Euterpe processor to set the configuration to the power 

cite 'rinr "^'"'"i ^"'^ '° 'l'^ Cerber-us power and su'ng 

l .'^^"^'S"^^"^" ';e§«ster and Hermes channel configuration registers 
stabilize the phase locked loops, set the local TLB to translate S local' tuai 
addresses to equal physical addresses, and initialize a single event thread o be".n 

dear '° " ^"'^ « end of the logic 

Machine nhf=^nl^ 

fhlTels or'thrrlK"°"K'"'^ communications errors in one of the Hermes 
^rn r. n r 3 watchdog timeout error, or mternal cache paritv 

Sn.kr." if^' 'l '"^^^'"^/heck. A machine check will set the local TLB to 

he exception in the Cerberus status register, and transfer control of the event 

differs Sn tlT '"T " »° °f ^ ^«et bS 

?in!L configuration settings, main thread state, and Cerberus and 

Mnemosyne state are preserved. ^»"ciu» uuu 

Recovery from machine checks depends on" the severity of the error and the 
potential loss of information as a direct cause of the error. The start vector addrest 
IS designed to reach mstruction memory accessed via Cerberus, so that operation 
of machme check diagnostic and recovery code need not depend on proper 
operation or contents of any Hermes channel device. The program counter and 
register file state of the event thread prior to the machine check is lost (except for 
the portion of the program counter saved in the Cerberus status register) so 
diagnostic and recovery code must not assume that the register file state is 
prior operating state of the event thread. The state of the main 
thread is frozen simdarly to that ot a main thread exception. 
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Machine check diagnostic code determines the cause of the machine check from 
the processors Cerberus status register, and as required, the Cerberus status and 
other registers of devices connected to the ByteChahr^els. Anv outstandino 
memor>- transactions may be recovered by a combination of software to re-issue 
outstanding urices. and by aborting and restarting the main thread execution 
pipeline to purge outstanding reads. 

Because Cerberus operates much more slowly than the peak speed of the Euterpe 
processor under normal operation, machine check diaanostic and recover.- code 
vxiU generaUy consume enough time that real-time interface performance tarects 
mav have been missed. Consequently, the machine check recoverv software mav 
need to repair further damage, such as interface buffer underruns and overruns 
as may have occurred during the intenening time. 

This final recovery code, which re-iniciaUzes the state of the interface svstem and 
recovers a functional event thread state, may return to using the 'complete 
machine resources, as the condition which caused the machine check wUl have 
been resolved. 



The following table lists the causes of rr.achine check errors. 
Paruy or uncorrectable error in Euiercs cacne 



Parity or uncorrectable error in Mnemosyne cache 



Paniy or uncorrectable error in CaHicoe memory 



Parity or uncorrectable error in sys;em-level memory 



Communications error in Hermes cnann eis 
Communications error in Cerberus bus 



Event Thread exc eption 
Watchdog timer 



machine cneck errors 
Pantv or Unccrrectable E^rpr 'n Q^chf^ 

When a parity or uncorrectable error occurs in a Euterpe or Mnemosvne cache 
such an error is generally non-recoverable. These errors are non-recoverable 
because the daia m such caches may reside anywhere in mcmorv. and because 
the daia in such caches may be the only up-to-date copy of that memorv concents. 
Consequendy, the entire contents of the memory store is lost, and the severity of 
the error is high enough to consider such a condition to be a system failure. 

The machine check provides an opportunity to report such an error before 
shutting down a syscem for repairs. 

There are specific means by which a system may recover from such an error 
without failure, such as by restarting from a system-level checkpoint, from which a 
consistent memory state can be recovered. 
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Then a parity of uncorrectable error occurs in Mnemosvne nr r.Il 
such an error mav be partiallv recov-rahl^ 'Mnemos>ne or Calhope memor\-. 
memory is lost, and con equLtriTe r. U T V^^ ^"^"^^ 

generally be aborted, orresumed tV^nJ ^ that memory must 

the affected memorv can be ^cover-d ^^e^kpomt. It the contents of 

possible. ■ '"^^ "o^SC' a complete recovery is 

If the affected memorv is that of a critical Dart nf 

condition is considered a svstem Mil. ^ operatmg system, such a 

from a system-level checkpoint ^'ccomplished 

Bits corresponding to the atferr^rl R-rt^o,. u i 

Cerberus status register Vcoverv .5 " u processor's 

affected, bv quer'S Se Cerberu. T determine which devices are 

Hermes channels °^ "'^^ '^^"<^^ °" ^'''^ affected 

ct1k"l::i.?e'X"3nT^ch':ck f"" ""'r^" °f ^ 

transactions ^vill not brcomnUted 1/.?."" f^' """" channel(s), these 
memory interface buffers foT uncotnl.^^^^^ '^"^^ '^""S^ 

stores. Chen must -etXt^^^^^^^^^^^ ^hem as 

Ccmr^uni rgfirns Error ,n n^^-^r ,o p, 

A communications error in the Cerberus bus snrh « o r» k 

error is generallv fuUv recoverable \ C^L Cerberus transaction 

mav resSlt from normd svsret r ^"or (due to timeout) 
existence of o"S,nardTvl« bThel!;^^^^^^^^^ 

Watchdnn Tim^, ,i- f=rr^^ 



224 



Case 2:05-cv-00505-TJW Document 149 Filed 10/15/2007 Page 28 of 40 

wo 97/07450 PCT/US96/13047 



Start Vector Aririrn':^^^ 



ine s.art vector address is used to initialize the event thread uuh a proeram 
counter upon a reset clear or machine check. These causes of such inkiEfon 
can be diherennated by the contents of the Cerberus status register. 

TLVrn",X' i'^'^A ' ^^i'^'' -translated" bv the local 

Cerberus tZt '^hLT" ^^''T^'- ^"^"^ "^^e number zero on the local 
RO\I rL^ TK r K 7n ordmantv contain an interface to the bootstrap 
ROM code. The Cerberus/Bootstrap ROM space is chosen to minimize the 
number of mternal Terpsichore resources and Terpsichore inter^ac™ that must 
be operated to begin execution or recover from a machine check 



j virtual address 


description 1 


1 0x0003 0000 0000 0000 





Bootstran Onrlfr^ 

US are a necessary p 

version of chis document. 



Bootstrap code requirements are a necessary part of the Terpsichore ^vstem 
Architecture, but remains to be specified in a later version of this documenr"- 



The basic requirements of Terpsichore bootstrap code include Dovver-on 
minahzanon of Euterpe. Calliope, and Mnemosyne devices, usfng Cerberus 
furrh. Zr?'' ^'"f of niachine checks, selecnon of an interface from wh ch 
b/.eH ori " "l"^- scanned in a prioritv 

based ordering which gives highest priority to removable/replaceable read on v 
storage devices, then removable/replaceable read-write devices, then network 
interfaces, then non-removable storage devices. ""worK 

Cerberus Reaistf^r<=; 

^n'^mnl/^"/- ^«™^/"d/only and read/wnte registers which provide 
an implementation-independent mechanism to query and control the 
u^.r 'o^JT ' Terpsichore system. By the use of these registers a 

user ot a Terpsichore system may taUor the use of the faciliues in a general- 
purpose implementation for maximum performance and utility. Conversely a 
supplier ot a Terpsichore system component may modifv facilities in the device 
without compromising compatibility with earUer implementations. These registers 
are accessed via the Cerberus serial bus. icgibiers 

f ctr'^'JV°kP°'''"' °^ ' Terpsichore system, each Euterpe processor contains 
a set ot Lerberus-accessable configuration registers. Additional sets of 
configuration registers are present for each additional device m a Euterpe system 
including Mnemosyne Memory devices, and Calliope interface devices. 

Read/only registers supply information about the Terpsichore svstem 
implementation m a standard implementation-independent fashion. Terpsichore 
software may take advantage of this information, either to verifv that a compatible 
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implementation of Mnemosyne is installed, or to tailor the use of the part to 
conform to the characteristics of the implementation. 

The read/only registers occupy addresses 0„5. An attempt to write these registers 
may cause a normal or an error response. ^ 

Read/write registers select operating modes and select power and vohaae levels 
tor gates and signals. The read/vvrite registers occupy addresses 6..9 and 25. A}, 

Reserved registers in the range 10..24 and 44..65 must appear to be read/only 
registers with a zero value. An attempt to write these registers mav cause a normal 
or an error response. 

Reserx-ed registers in the range 64..2'-6.i may be implemented either as read/oniv 
registers with a zero value, or as addresses which cause an error response if reads 
or writes are attempted. 

The format of the registers is described in the table below. The octlet is the 
Cerberus address of the register; bits indicate the posifion of the field in a register. 
The value indicated is the hard-wired value in the register for a read/onlv register, 
and IS the value to which the register is initialized upon a reset for a read/write 
register If a reset does not initialize the field to a value, or if initialization is not 
required by tnis specification, a is placed in or appended to the value field. The 
range is the set of legal values to which a read/write register may be set. The 
interpretation is a brief description of the meaning or utility of the register field; a 
more comprehensive description follows this table, 
octlet bits 
0 63.. 16 



15.0 



octlet bits 
1 63.16 



15. .0 



architecture 
code 


0x00 
40 
a3 
24 
69 
93 




Identifies processor device as 
compliant with MicroUnity Euterpe 
processor architecture. 


architecture 
revision 


0x01 
00 




Device conaplies with architecture 
version 1.0. 


field name value ranae interoretation 


implementor 
code 


0x00 

40 

as 

d2 

b6 

7f 




Identifies Euterpe processor device 
as implemented by MicroUnity. 


implementor 
revision 


0x01 
00 




Implementation version 1.0. 
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Dctlei btts 
2 53. 16 



octlet 
3 



cc:iet 



15. .0 



bits 
63.. 16 



btts 

63. .60 
59..56 
55. .54 

53 

52 

51. .48 
47 
46 

45 

44.. 40 

39 
38 

37 
36..32 

31. .30 
29..28 
27..24 
23..21 
20.. 16 
15..0 



fietd name 


value 


'5'-:= 




code 


UXuC 
40 

69 
db 
3f 




ideniines initial manufacturer of 
Euterpe processor device 
implemented by MicroUnity. 


manuTaciurer 
revision 


0x0 1 

nn 
UU 




Manufacturing version 1.0. 


t!9ld name 


value 




inigrorstation 


serial 
number 


0 




u^viuc flU ociiai numoer 
capability. 


dynamic 
address 


0 




lino ucviuc iias DO oynamic 
addressina caoabilitv 


field name 


value 


ra-:e 


iniefDretation 


A 


4 


0. 15 


Size of a Hermes address 


iogzW 


3 


0..15 


Size of a Hermes word 


0 


0 




reserved 


1 


0 


0..1 


set if supDort for Icarus 


i 


1 


0,.1 


iwy^ iicini^b worus per inieneave 
block 


H 


2 


0..15 


number of Hermes channels 




0 




reserved 


Ce 


0 


0..1 


set if in^truptrnn QRAK>1 r^nn Kj^ '^ii 

cache (enough taq storage) 




1 


0..1 


set if instruction ^RAM ran Ko all 

buffer 


c 


9 


0..31 


1092 cache blocks in instruction 
SRAM (buffer+cache) 




0 




resen/ed 




0 


0..1 


set if data SRAM can be all cache 
(enough tag storage) 


Db 


1 


p..1 


set if data SRAM can be all buffer 


D 


9 


0..31 


1092 cache blocks in data 
SRAM/buffer+cache 




b 




Reserved 


L 


0 


0..3 


1092 entries in local TLB (per thread) 


G 


r 


0..15 


ioQ2 entries in olobal TLB 




0 




reserved 


T 


b 


1..31 


number of execution threads 




0 




Reserved for definition In later 
revision of Euteroe architecture 
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bits 


field name 


value 




•ntercreiaiion 


53.0 




Q 




Reserved for definition in later 
revision of Euterce architecture 


bits 


field name 


valus 


ranag 


intercreiation 


63 


reset 


1 


0..1 


set to invoke device s circuit' reset 


62 




1 
1 


U.. 1 


set to invoke device's logic clear 


61 


selftest 


0 


0..1 


set to invoke device's selftest: bits 
60. .48 may indicate depth of selftest 


60 


defer writes 


0- 


0..1 


set to cause writes to octlets 25..43 
to be deferred until the next logic- 
clear or non-deferred write. 


59..48 


0 


0 


0 


Reserved 


47. .44 


Hermes 
channel 
expansion 


0 


u 


Reserved for additional Hermes 
channel disable bits. 


43..32 


Hermes 
channel 
disable 


4095 


D..40 
S5 


For each Hermes channel, set to 
cause input channel to be ignored 
and idles to be generated. Upon 
clearing the bit. the input channel 
phase adjustment is reset, and after 
a suitable delay, the input and 
Hermes output channel links are 
available for use. 


31. .20 


0 


0 


0 


Reserved 


19. .16 


channel 
under test 


0* 


0..11 


Channel on which cidle 0 and cidle 1 
are transmitted in place of normal 
idle pattern (0. 255). and from which 
raw input bytes are sampled. 


15.8 


cidle 0 


0- 


0..25 
5 


Value transmitted on idle Hermes 
output channel when output clock 
zero (0). 


7..0 


cidle 1 


255- 


0..25 
5 


Value transmitted on idle Hermes 
output channel when output clock 
one (1). 
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bits 


field name 


value 






fnieroretation 


63 


reset/clear/ 
selftest 
complete 


1 


0..1 


This bit is set when a reset, clear or 
comoleted. 


62 


reset/clear/ 
selftest 
status 


1 


0..1 


t MI^ lilt iC C3f XSjhd^ 'a rQP'Qt i^lft'ir 

MHO uii ici wiifcn ci rsssi. c*car or 
selftest operation has been 
conriDleted successfully. 


51 


meltdown 
detected 


0 


0..1 


II Ho UII 15 Set wnen ine meiluown 
detector has caused a reset. 


60 


double 
machine 
check 


0 


0 1 


This bit is set when a double 
machine check has caused a reset. 


59 


other reset 


0 


U.. 1 


This bit is reserved for indicating 
additional causes of reset. 


53 


exception in 
event thread 


0 


0..1 


This bit is set when an exception in 
event thread has caused a machine 

check. 




watchdog 
timeout error 


0 


C..1 


This bit is set when a watchdog 
timeout has caused a machine 
check. 


56 


Cerberus 
transaction 
error 


0 


0..1 


This bit is set when a Cerberus 
transaction error has caused a 
machine check. 


55 


Hermes 
channel 
check byte 
error 


0 


0..1 


This bit is set when a Hermes 
channel check byte error has caused 
a machine check. 


54 


Hermes 
channel 
command 
error 


0 


0..1 


This bit is set when a Hermes 
channel command error has caused 
a machine check. 


53 


Hermes 
channel 
timeout error 


0 


0..1 


This bit is set when a Hermes 
channel timeout has caused a 
machine check. 




o 


0* 


0 




Reserved for other machine check 
causes. 


47..32 


check detail 


u 


0..40 
95 


Set to indicate exception code if 
Exception in event thread. Set to 
bitmap of which Hermes channels if 
Hermes channel error. 


31. .16 


machine 

check 
program 
counter 


0 


0 


Set to indicate bits 31.. 16 of the 
value of the event thread program 
counter at the initiation of a machine 
check. 


15..8 


raw 0 




0.25 
5 


Value sampled on specified Hermes 
channel when input clock is zero (0). 
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7 .0 



octiet bits 
8 53..0 



cctlet bits 
9 63.0 



octiet bits 
10..24 63. .0 



raw 1 




0..25 
5 


vdiue bdmpiea on specitied Hermes 
channel immediately following 
samole value in raw 0 register 


field name 


value ranee 


intercretation 


indirect 
address 


0- 


4-1 


vvrtie TO tnis r^ygister to set physical 
address used for reads and writes to 
indirect data reqister. 


field name 


value value 


interoretation 


indirect data 




0..26 


Read and write to this register to 
reach physical addresses not 
otherwise accessable via Cerberus 


field name 


value 


ranoe 


interorgtation 


0 


0 


0 


Reserved for expansion of Cerberus 
registers upward or knobcity registers 
downward. 
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ocii9t bus 

25 63. .56 

55.. 43 
47 AO 
39..32 
31. .24 
23.. 15 

15. .8 
7..0 

octiet bits 

26 63,. 56 

55-.48 
47. .40 
39.32 
31. .24 
23., 16 
15.. 8 
7.0 

octlei bits 

27 53. .55 

55.-48 
47.. 40 
39..32 



LI na fi i a n rl 

Custom knob 




1 1 9 

-7 

i 


intercreiaticn 

Knob seuings for Unassigned custom 
circuits. 


Unassigned 


121 


1..12 

7 
/ 


Knob sellings for Unassigned custom 

circuits. 


CI Tag knob 


121 


■..12 

f 


Knob settings for CI Tag circuits. 


CD Tag knob 


121 


1..12 

t 
1 


Knob settings for CD Tag circuits. 


TLB knob 




7 


^noD settings for TLB circuits. 


Branch 
Target Cache 
knob 


121 


1..12 
7 


Knob settings for Branch Target 
Cache circuits. 


1 Cache knob 


121 


1.12 
7 


Knob settings for Instruction Cache 

circuits. 


Eastside 
Repeater 
knob 


121 


1..12 
7 


Knob settings for Easiside Repeater 
circuits. 



spar 1,2 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 


spar 1,2 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 


spar 1,2 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 


spar 1,2 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 
1,2. 


spar 0 knob 


121 


1..12 
7 


Knob settings for SOFA region spar 
0. 


spar 0 knob 


121 


1..12 
7 


Knob sellings for SOFA region spar 


spar 0 knob 


121 


1..12 
7 


Knob settings for SOFA region spar 


spar 0 knob 


121 


1..12 
7 


Knob settings for SOFA region spar 
0. 



spar 5,6 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 
5,6. 


spar 5,6 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 
5.6. 


spar 5,6 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 
5,6. 


spar 5,6 
knob 


121 


1.,12 
7 


Knob seitinos for SOFA region spar 
5.6- 
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3r.24 
23. .16 
15. .8 
7.0 

ociiet bits 
23 63. .56 

55-48 

47.. 40 

33..32 

31.. 24 

23.. 16 

15,.8 

7..0 

octiet bits 
29 63. .56 

55..48 

47..40 

39..32 

31. .24 

23.. 16 

15..8 

7.0 

octiet bits 



spar 3,4 
knob 


121 


i..l2jKnob settings for SOFA region spar 


spar 3,4 
knob 


121 


i..12knoD settings for SOFA region spar 


spar 3,4 
knob 


121 


1 .l2|Knob settings for SOFA region spar 


spar 3,4 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 


field name value ranee inr^^'-'-rotation 


spar 9,10 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 
9.10. 


spar 9,10 
knob 


121 


1..12 
7 


Knob senings for SOFA region spar 
9,10. 


spar 7,8 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 


spar 7,8 
knob 


121 


1..12 
7 


Knob settings tor SOFA region spar 


spar 7,8 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 
7,8. 


spar 7,8 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 


Clocks knob 


121 


1..12 
7 


Knob settings for clock circuits. 


PLL knob 


85 


1..12 
7 


Knob settings for PLL circuits. 


fieldname value ranae internrPtatinr. 


spar 13,14 
knob 


121 


1..12 

7 


Knob settings for SOFA region soar 
13.14. 


spar 13,14 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 
13.14. 


spar 11,12 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 
11.12. 


spar 11,12 
knob 


121 


1..12 
7 


Knob senings for SOFA region spar 
11.12. 


spar 11,12 
knob 


121 


1.,12 
7 


Knob settings for SOFA region spar 
11.12. 


spar 11,12 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 
11.12. 


spar 9,10 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 
9.10. 


spar 9,10 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 
9.10. 



field name value ranae 



interpretation 
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Hermes 
channel 
knob 


121 


1..12 
7 


knob settings for Hermes channel 
circuits. 


Westside 
Repeaters 
knob 


121 


1..12 
7 


Knob settings for Westside Repeater 
circuits. 


D Cache 
knob 


121 


i:.i2 

7 


Knob settings for Data Cache 

circuits. 


Spring knob 


121 


1..12 

7 


Knob settings for Spring circuits. 


Unassigned 
Custom knob 


121 


1..12 
7 


Knob settings for Unassigned custom 
circuits. 


Unassigned 
Custom knob 


121 


1..12 
7 


Knob settings for Unassigned custom 
circuits. 


spar 13,14 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 
13.14. 


spar 13,14 
knob 


121 


1..12 
7 


Knob settings for SOFA region spar 
13J4. 
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:c-J6t bus 
31 63 



so 

54. .5 

51 

50.. 43 

47. .43 
42 

41 

40 

39..35 
34 

33 

32 

31. .24 
23.22 
21 

20 



Jietd narr.e 



n 


n 
U 


u 


Reserved 


resistor fine 
tuninq 


20- 


0..3-1 


Set to fine tune resistor termination 
value 


swing fine 
tuning 


r 


0.,3 


bet to fine-tune voltage swing and 
reference level knob settinas 


0 


0 


0 


Reserved 


process 
control 


5 


4.6 


Set based on value read from PMOS 
drive strength, used to fine-tune 
resistor values in knob settings 


0 


0 


0 


Reserved 


PMOS drive 
strength 




0..7 


This read/only field indicates the 
uMvc ouyiiyiri ui riviuo devices 
expressed as a digital binarv value 


PLL1 divide 
ratio 


8' 


8.. 23 


PLL1 divider ratio 


PLL1 
feedback 
bypass 


r 


0..1 


Set to invoke PLL1 feedback bypass. 


PLL1 range 


0" 


0.1 


Set for operation at high frequency 
^aoove O.xxx GHz); cleared for 
operation at low frequency (below 
O.yyy GHz). 


PLL 
prescaler 
bypass 


0 


0..1 


Set to invoke PLLO and PLLI 
prescaier oypass, otherwise divide 
input clock by 10. 


PLLO divide 
ratio 


8 


8..23 


PLLO divider ratio 


PLLO 
feedback 
bypass 


1 


0.1 


Set to invoke PLLO feedback bypass. 


PLLO range 


0 


D..1 


oci luf uperauon ai nign frequency 
(above O.xxx GHz); cleared for 
operation at low frequency (below 
O.yyy GHz). 


conversion 
prescaler 
bypass 


0 


0..1 


Set to invoke temperature conversion 
prescaler bypass, otherwise divide 
input clock by 10. 


analog 
measurement 


0 


0..25 
5 


Set to measure analog levels at 
various test points within device 


meltdown 
threshold 


0 


0..3 


Set to perform margin testing of the 
meltdown detector. 


conversion 
start 


0* 


D..1 


Setting this bit causes the 
conversion to begin. The bit remains 
set until conversion is complete 


0 


D 


J 


Reserved, (selection extension) 
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19.. 16 


WWII wrstiwri 

selection 




r o 


rieiu :5ci9CtS WrUCn OT ten 


15..10 


0 


0 




Reserved, (counter extension) 


9..0 


conversion 
counter 


G- 


O.lO 
23 


This iield is set to the two's 
complement of the downslope count. 
The counter counts upward to zero, 
and then continues counting on the 
uDSlooe until conversion ccmoletes. 


octtet bits 


figicj name 






intercretalion 


32.-43 63 


0 


0 


C 


Reserved 


62 


quadrature 


0' 


C..1 


Setting this bit causes the 
quadrature circuit to be bypassed: 
the input clock signal is used 

uirecuy. 


61 


quadrature 
ranae 


G' 


C-..1 


Set to 0 if the Hermes channel is 
operating at a low TreQuency; 1 if the 

1 Id II ICO wiiciiiiit;! lb wpcfaliny 31 a. 

high freQuency. 


60 


output 
termination 


1 


D,.l 


Set to ©nsblp nntnut tPrminafnrQ 

Cleared to disable output 
terminators. 


59..57 


termination 
resistance 


1 


0..7 


Set termination resistance level. 


56..54 


output 
current 


1 


O.J 


Set output current level. 


53.. 48 


skew bit 7 


1 


D..63 


Set delay in Ho7 skew circuit. 


47. .42 


skew bit 6 


1 


C..63 


Set delay in Ho6 skew circuit. 


41. .36 


skew bit 5 


1 


b.-63 


Set delay in Ho5 skew circuit. 


35..30 


skew bit 4 


1 


0..63 


Set delav in Ho4 skpw rirmit 


29. .24 


skew bit 3 


1 


0..63 


Set delay in Ho3 skew circuit. 


CO.. 1 O 


Skew bit 2 


1 


0.63 


Set delay in Ho2 skew circuit. 


17.. 12 


skew bit 1 


1 


0.63 


Set delay in Hoi skew circuit. 


11. .6 


clfAUi Kit* O 


1 


0..63 


Set delay in HoO skew circuit. 


5..0 


Skew oik 


1 


0.63 


Set delay in HoC skew circuit. 


octlet bits 


field name 


value ranae 


interoretation 


44..63 63..0 


0 


Q 


0 


Reserved for use with additional 
Hermes channel interfaces 


octlet bits 


field name 


value ranae 


interoretation 


64.. 63. .0 
65536 


0 


0 


0 


Reserved for use with later revisions 
of the architecture. 



configufation memory space 
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MicroL'nity's company identifier is: 0000 0000 0000 0010 1100 0101. 
MicroUnity-s architecture code for Eursrpe is specified by the tollowine table: 



Internal coae name 


Code numoer | 


lEuteroe 


UxOO 40 a3 24 69 93 i 



Euterpe architecture revisions are specified by the foUowing table 



j Internal code name 


1 (Jode number 




1 0x01 00 ~ 



MicroUnity's Euterpe implementor codes are specified by the foUowine table: 



Internal code name 


Code number 1 


iMicroUnity 


0x00 40 a3 62 b6 7f 



Internal code name 


Revision number 


1.0 


0x01 00 







Internal code name 


Code number | 


MicroUnity 


0x00 40 a3 69 db 3f | 



MicroUnity's Euterpe as implemented by MicroUnity, and manufactured bv 
MicroUmty, uses manufacturer revisions. as soecified bu th^ fr,Ilminn» ,»UI-. 



Internal code name 


Code number 


1.0 


0x01 00 







Architecture Desrhotion Pff/7/<?r^rQ 

The architecture description registers in ocdets 4 and 5 comply vi'ith the Cerberus 

n«J»™l^r^ll°A'"^wT'^^ <>f the architecture 

parameters: A and W described in diis document. 
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These registers are still under construction and uiU contain non-zero values m a 
later revision ot this document. 

Parameters wUl describe number of Hermes ports, size of internal caches, inte-rat 
ion ot CaU lope and Mnemosyne functions. mic^rai 

The control register is a 64-bit register u-ith both read and u-rite access It is 
altered only by Cerberus accesses: Euterpe does not alter the values vvri ten to 
this register. "imcn lu 

The reset bit of the control register complies %vith the Cerberus specification and 

wSnfa nf T'u"" •'"'^'^•^^"^^ ^"^"P^ - ^ TerpSore sv"t m^ 

^ n 1° bit IS equivalent to a power-on reset or a broadcast 
Cerberus reset (lou- level on SD for 33 cycles) and resets configuration registers to 
their power-on values, which is an operatmg state that consumes minimal cur en: 
ett1rsX?en fnTS:'^ high-bandu4dth logic to be reset. The duration o/ the 

is s^L7Z 

'^L^tVrhi' Kt'^*" '°T'°* l^T" '''"^ Cerberus specification and 

^°^rt\t nr 'l"^u- indisM Euterpe de^ce in a svstem. 

Writing a one (1) to chis bit causes all internal high-bandwidih logic to be reset as 
IS required after reconfigurmg power and su-ing levels. The duration of the reset iS 
sufficient for any operating state changes to have taken effect. At the omplet on 
set ^LT«e^^f""ri/^' resei/clear/selftest complete bit of the status re|ster is 
set the reset/dear/selftest status bit ot the status register is set. and the dear bit 
or tnc control register is set. 

The selftest bit of the control register complies with the Cerberus specification 
and provides the ability to invoke a selftest on an individual Euterpe device in a 

S;in" rS,°ir'-|I "'''P^'^^f' not define a selftest mechanism at'^this time, so 
setting this bit will immediately set the reset/dear/selftest complete bit and the 
reset/dear/selftest status bit of the Status register. 

The channel under test fidd of the control register provides a mechanism to test 
and adjust skews, on a single Hermes channd at a time. The fidd is set to the 
channd number for which the cidle 0. cidle 1. raw 0. and raw 1 fidds are active. 

The cidle 0 and ddle 1 fidds of the control register provide a mechanism to 
repeatedly sent simple patterns on the selected Hermes output channel for 
purposes ot testmg and skew adjustment. For normal operation, the cidle 0 field 
must be set to zero (0). and the ddle 1 fidd must be set to dl ones (255) 
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StBtus Pecis:9r ■ 

The status register is a 64-bit register with both read and write access, though the 
only legal value which may be written is a zero, to clear the reaister. The result of 
writing a non-zero value is not specified. 

The resei/clear/selftesi complete bit of the .Status register complies with the 
Cerberus specitication and is set upon the completion of a reset, clear or selftest 
operation as described above. 

The resei/clear/selftest status bit of the status register complies with the Cerberus 
specitication and is set upon the successful completion of a reset, clear or selftest 
operation as described above. 

The meltdown detected bit of the status register is set when the meltdown 
detector has discovered an on-chip temperature above the threshold set bv the 
meltdown threshold tield ot the Cerberus configuration register, which causes a 
reset to occur. 



The double machine check bit ot the status register is set when a second machine 
check occurs that prevents recovery from the first machine check, or which is 
indicative ot machme check recovery software failure. SpecificaUv, the occurrence 
^rr«r 'Xr'°" T"' '^jead. wa.chdog timer error, or Cerberus transaction 
error uhde any rnachine check cause bit of the status register is still set or anv 
Hermes error while the exception in event thread bit of the status register is set i^i 
the Cerberus status register results in a double machine check reset. 

The other reset cause bit of the status register is reserved for the indication of 

Other causes ot reset. 

The exception in event thread bit of the status register is set when an event thread 
sutteri and exception, which causes a machine check. The exception code is 
loaded into the machine check detail field of the status register. 

The watchdog timeout error bit of the status register is set when the watchdog 
timer register is equal to the clock cycle register, causing a machine check. 

The Cerberus transaction error bit of the status register is set when a Cerberus 
transaction error (bus timeout, invalid transaction code, invalid address) has 
caused a machine check. Note that Cerberus aborts, including locally detected 
parity errors, should cause bus retries, not a machine check. 

The Hermes check byte error bit of the status register is set when a Hermes 
check byte error has caused a machine check. The bit corresponding to the 
Hermes channel number which has suffered the error is set in the machine check 
detail Field of the status register. 

The Hermes command error bit of the status register is set when a Hermes 
command error has caused a machine check. The bit corresponding to the 
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Hermes channel number u-hich h.5 suttercd the error is set in the machine check 
detail field ot the status register. ""•.muc tncLK 

The Hermes timeout error bit ot the stjtus register is set u-hen aBvteChannel 
nmeout error has caused a machine check. The bit corresponding tfthe Herme 

The machine check detail field of the status register is set when a machine check 
rilo r? ,1°'"P\^'^^: ' channel error (check bvte. command or 

TclTl;. ^V ''''^'Tl- ' ByteChannels for which machine 

checks have been reported. For an exception in event thread, the- value indicates 
the type of exception tor which the most recent machine check has been Kponcd 

M 1' "^""^ ^u^'T -'^^^ °^ ^"^"^ ^^'sier is loaded *ith bits 

for nnrnnJc «7 f- ^''^ '^''^^ P^o^ides a limited diagnostic capabUitv 

tor purposes ot software development, or possibly for error recovery. 

J^-o XcemlrolL''n?;f,°'" Tf^'^'''' obtained from 

tuo adjacent samples ot the specified Hermes input channel. The raw 0 field 
contams a value obtamed when the input dock was zero (0). and the ral 1 f e d 

r^-^ln? (!)• Euterpe rnust ensure that reading the status register produces two 
Srberus T7i?f.S3"'""' °i f °^ """^ read operation on 

SrmT;h?n::i interfaTes"^' '"^"^ °* ""'"^ ^"'^ ^^'^-^ 

Power and Swing Calihmticr .qQOAc;fp rc: 

feieU^Lf/'fo °^ configuration registers to control the power and voltaee 
levds used for internal high-bandwidth logic and memorv. The details of 
programming these registers are described bdow. ' 

Sf F'.ul'rn^^J separately control the power and voltage levels used in a ponion of 
format ' configuration data in the foUowing 

,76 54 .TP 0 

I O I ret i Ivl i res | 

~n 2 2 -x — ' 



power and swing controls 
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The range of valid values and ihe interpretation of the fields is given bv 
following table: 



the 



field 


value 


inrercretation 


0 


0 


Reserved 


ref 


0..3 


Set re-erence voitace level 


Ivl 


0..3 


Set voitage swing level. 


res 


0.7 


Sei resistor load value. 



The reterence voltage level, voltage sxving level and resistor load value are model 
figures for a full-svving. lowest-power logic gate output. The actual voltage levels 
and resistor load values used m various circuits is geometricallv related to the 
values m the t^bl" below. Designed t^•pical. fuU-spced settings for the ref. Ivl and 
res fields are ref=2:>0 millivolts. Ivl=500 millivolts, and res=2.5 kilohms. 

The ref field, together with the swing fine tuning field of the configuration register 
control the reterence voltage level used for logic circuits in the specified knob 

u^cr^cs ;i miui^d;" °' '"^^ ''^"--^ ^^^^^^ 





swing fine tunina 


ref 


0 


1 


2 


3 


0 


138 


150 


163 


175 


1 


188 


200 


213 


225 


2 


238 


250 


263 


275 


3 


288 


300 


325 


350 



The Iv field, together with the swing fine tuning field of the configuration register 
control the voltage swing level used for logic circuits in the specified knob domain 
Values and mterpretaaons of the Ivl field are given bv the following table, uith 
units m millivolts: ^ 





swing fine tuninq 


Ivl 


0 


1 


2 


3 


0 


275 


300 


325 


350 


1 


375 


400 


425 


450 


2 


475 


500 


525 


550 


3 


575 


600 


650 


700 



Voltage swing level contro 



field interpretation 
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The res field, together with the process control field of ihe configuration register^ 
control the PMOS load resistance value used for logic circuits ""in the specified 
knob domain. Values and interpretations of the Ivl field are siven by the following 
table, with units in kilohms. The table below gives resistance values with nominal 
process parameters. 





process control 


res 


0 


V 


2 


3 




5 


6 


7 


0 




undefined 


1 




2.5 


5.0 


7.5 


10. 


13. 


15. 


18. 


2 




1.3 


2.5 


3.3 


5.0 


6.3 


7.5 


8.8 


3 




.83 


1.7 


2.5 


3.3 


4.2 


5 


5.8 


4 




.63 


1.3 


1.9 


2.5 


3,1 


3.8 


4.4 


5 




.50 


1.0 


1.5 


2.0 


2.5 


3 


3.5 


6 




.42 


-83 


1.3 


1.7 


2.1 


2.5 


2.9 


7 




.36 


.71 


1.1 


1.4 


1.8 


2.1 


2.5 



Resistor conuol fieia interpretation 



n%^^c^j^ process control field ot the configuration register is set equal to the 
PMOS drive strength field of the configuration register, nominal PMOS load 
resistance values are as given by the following table, with units in kilohms. 



res 


PMOS load resistance 


0 


undefined 


1 


13. 


2 


6.3 


3 


4.2 


4 


3.1 


5 


2.5 


6 


2.1 


7 


1.8 



When Mnemosyne is reset, a default value of 0 is loaded into each 0 field, 3 in 
each ref field, 3 in each Ivl field and 1 in each res field, which is a byte value of 
121. The process control field of the configuration register is set to 5, and the 
swing fine tuning field is set to 1. These settings correspond to a chip with nominal 
processing parameters, low power and high voltage swing operation. 

For nominal operating conditions, the ref field is set to 2, the Ivl field is set to 2. and 
the res field is set to 5, which is byte value of 85. The process control field is set 
equal to the PMOS strength field, and the swing fine tuning field is set to L 

Confjguration Register 

A Configuration register is provided on the Euterpe processor to control the fine- 
tuning of the Hermes channel configuration, to control the global process 
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parameter setcings. ro control the nvo phasclocked loop frequencv generators 
and to control the temperature sensors and read temperature values.' 

The resistor fine tuning field of the configuration register controls the analog bias 
settings tor PMOS loads in Hermes channel input and output termination circuits 
m order to accomodate variations in circuit paramaters due to the manufacturmg 
process, and to provide fine-tuning ot the input and output impcdence levels 
Under normal opcratmg conditions, four times (4^') the value read from the PMOS 
drive strength field should be written into the resistor fine inning field. In order to 
provide tinc-tuning ot the input and output impedence levels, an external 
measurement ot the impedence or voltage levels is required. An change of the 
"^d" 'Tr'^L"£^^^^^^^^^^ -P- and output 



value 


resistor fine tuning 


0..13 




14. .19 


increase PMOS conauctance to nominal'ZO/value 


20 


use PMOS loads at nominal ccnauctance 


21. .31 


decrease Pmus concuctance to nominar20/value 



The swing fine tuning field of the configuration register controls a small offset in 
?nl , ""''^ t '^^^ ^"1'°!^^ swing voltage for internal logic circuits. The swing 
fme tuning voltage is added to the output current field of the Hermes channd 
contigurauon reggters to select the output current. The interpretation of the field 



value 


swing fine tuning 


reference tine tuning 


0 


-25 mv 


-12 mv 


1 


0 


0 


2 


+25 mv 


+13 mv 


3 


+50 mv or +100 mv 


+25 mv or +50 mv 
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The process control lield of the conriguration register controls the analog bias 
settings tor PMOS loads in internal logic circuits, in order to accomodate 
variations m circuit parameters due to the manufacturing process. Under normal 
operating conditions, the value read from the PMOS drive strength field should be 
written into the process control field. The interpretation of the field is given bv the 
table: ^ 



value 


process control 


0 


Reserved 


1 


increase PMOS concjciance to 5.00'nomina! 


2 


increase PMOS conductance to 2.50*nominal 


3 


increase PMOS ccncjctance to 1.66'nominal 


4 


increase PMOS conaucrance to 1.25'nominal 


5 


use PMOS loadsar ncminal conductance 


6 


decrease PMOS concuctance to 0.83'nominal 


7 


decrease PMOS ccncLictance to 0.7r nominal 



The PMOS drive strength field of the configuration register is a read/onlv field 
that indicates the drive strength, or conductance gain, of PMOS devices on the 
Euterpe chip, expressed as a digital binar>- value. This field is used to calibrate the 
power arid voltage level configuration, given variations in process characteristics 
of individual devices. The interpretation of the field is given b« the table: 



value 


PMOS drive strength 


0 


Reserved 


1 


0.2''nominat 


2 


0.4'nominal 


3 


0.6'nominal 


4 


0. Binominal 


5 


nominal 


6 


1.2'nominal 


7 


1.4'nominal 



There are two identical phase locked-loop (PLL) frequency generators, 
designated PLLO and PLLl. These PLLs generate internal and external clock 
signals of configurable frequency, based upon an input clock reference of either 
50 MHz or 500 MHz. PLLO controls the internal operating frequency of the 
Euterpe processor, while PLLl controls the operating frequency of the Hermes 
channel interfaces. The configuration fields for PLLO and PLLl have identical 
meanings, described below: 

The PLLO divide ratio and PLLl divide ratio fields select the divider ratio for 
each PLL. where legal values are in the range 8., 23. These divider ratios permit 
clock signals to be generated in the range from 400 MHz to 1.15 GHz. when the 
input clock reference is at 50 MHz, \viih prescaling bypassed, or at 500 MHz with 
prescaling used. 
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Setiing che PLLO feedback bypass bit or the PLLl feedback bvpass bit ot the 
contiguration register causes the generated clock bvpass the PLL oscillator and to 
operate ott the input- clock directly. Setting these bits causes the frequency 
generated to be the optionally prescaled reference clock. These bits are cleared 
during normal operation, and set by a reset. 

The PLLO range field and the PLLO range field of the configuration register are 
usea to select an operating range for the internal PLLs. If the PLL range is set to 
zero, the PLL will operate at a low frequency (below O.xxx GHz ). if the PLL range 
'f ^^■'^ operate at a high frequency (above O.xxx GHz). At reset 

this bit is cleared, as the input clock frequency is unknown. 

Setting the PLL prescaler bypass bit of the configuration register causes the 
P''"^'^"^^? '°°P^, PLLP .""'^ PLLl to use the input dock direcdv as a reference 
clock^ This bit IS cleared during normal operation with a 500 MHz input clock, in 

^- '"P"' f^°f^ ^^}^' « 5" '^"""S "o^'nal operation with a 

50 MHz input clock. At reset this bit is cleared, as the input clock frequency is 
unknown. 

Setting the conversion prescaler bypass bit of the configuration register causes the 
temperature conversion unit to use the input clock directlv as a reference clock. 
Otherwise, clearing this bit causes the input clock to be divided bv 10 before use 
as a reference clock. The reference clock frequency of the temperature 
conversion unit is nommaUy 50 MHz. and in normal operation, this bit should be 
set or cleared dependmg on the input clock frequency. At reset this bit is cJeared. 
as the input clock frequency is unknown. 

The meltdown margin field controls the setting of the threshold at which 
meltdown is signalled. This field is used to test the meldown prex-ention logic. The 
interpretation ot i:he field is given by the table below with a tolerance of ±6 
degrees C, and 3 degrees C hysteresis: 



value 


meltdown threshold 


0 


150 degrees C 


1 


90 degrees C 


2 


50 degrees C 


3 


20 degrees C 



The conversion start bit controls the initiation of the conversion of a temperature 
sensor or reference to a digital value. Setting this bit causes the conversion to 
begin, and the bit remains set unul conversion is complete, at which time the bit is 
cleared. 

The conversion selection field controls which sensor or reference value is 
converted to a digital value. The interpretation of the field is given bv the table 
below: 
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value 




0 


local temosrature sensor 




locai lennperaiure reierence 


2 


remote 0 temoerature sensor 


3 


remote 0 temoerature reference 


A 


remote 1 temoerature sensor 


0 


remote 1 temoerature reference 


6 


rem.ote 2 temoerature sensor 


1 


remote 2 temoerature reference 


8. 


remote 3 temoerature sensor 


9 


remote 3 temoerature reference 


10. 15 


Reserved 



The conversion counter field is set to the two's complement of the downslope 
count. The counter counts upward to zero, at which point the upslope ramp 
begins, and continues counting on the upslope until the conversion completes. 

Heroes chsnnRl norfiour-rrr Peo/^^fffrg ; 

Configuration registers are provided on the Euterpe processor to control the 
timing current levels and termination resistance for each of the t^velve Hermes 
channel high-bandwidth channels. A configuration register is dedicated to the 
control ot each Hermes channel, and additional information in the configuration 
register at octlet Jl controls aspects of the Hermes channel circuits in common, 
ihe Wermes channel configuration registers are Cerberus registers 32 43 where 
32 corresponds to Hermes channel 0, and where 43 corresponds to Hermes 
channel 11. 

The quadrature bypass bit controls whether the HiC clock signal is delayed by 
approximately of a HiC clock cv-cle to latch the Hi7..o bits. In normal, full speed 
operation, this bit should be cleared to a zero value. If this bit is set the 
quadrature delay is defeated and the HiC clock signal is used directly to latch the 
H17 .0 bits. 

The quadrature range iDii is used to select an operating range to the quadrature 
delay arcuii. H the quadrature range is set to zero, the circuit u-iU operate at a low 
frequency (below O.xxx GHz), if the quadrature range is set to one. the circuit will 
operate at a high frequency (above O.xxx GHz). 

The output termination bit is used to select whether the output circuits are 
resisuvely termmated. If the bit is set to a zero, the output has high impedence; if 
the bit IS set to one, the output is terminated with a resistance equal to the input 
termination. At reset, this bit is set to one, terminating the output. 

The termination resistance field is used to select the impedence at which the 
Hermes channel inputs, and optionally the Hermes channel outputs are 
terminated. The resistance level is controUed relative to the setting of the resistor 
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fine tuning field of the configuration register. The interpretation of the field is 
given by the table, with units in Ohms and nominal PMOS conductance and bias 
settings: 



value 


termination resistance 


0 


Reserved 


1 


250. Ohms ' ~~ 


2 


125. Ohms 


3 


83.3 Uhms 


4 


62.5 Ohms 


5 


50.0 Ohms ~ 


6 


41.7 Ohms - 


7 


35.7 Ohms 



The output current field is used to select the current at which the Hermes 
2?h units"in mA-"'' ^^"^ interpretation of the field is given by the table. 



value 


output current 


0 




1 


2. mA 


2 


4 mA 


3 


6. mA 


4 


8. mA 


5 


10. mA 


6 


12. mA 


7 


14. mA ■ ■ ' 



The output voltage su-ing is the produce- of the composite termination resistance: 
(input temunation resistance- l+output termination resistance- and the output 
current The output voltage swing should be set at or below 700 mV, and is 
normally set to the lowest value which permits a sufficiently low bit error rate, 
which depends upon the noise level in the system environment. 

The skew fields individually control the delay between the internal Hermes 
channe output dodc and each of the HoC and HoT. O high bandwidth output 
channel signals. Each skew field contains two three-bit values, named digital skew 
and analog skew as shown below: 

32 0 

I digital skew | analog skew | 



The digital skew fields set the number of delay stages inserted in the output path 
I ^^^c ^"^ the Ho7..0 high-bandwidth output channel signals. The analog 
skew fields control the power level, and thereby control the s\vitching delav of a 
single delay stage. Setting these fields permits a fine level of control over the 
relative skew between output channel signals. Nominal values for the output delav 
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tor various values of the digital skew and analog skew fields are given below 
assuming a nominal setting tor the Hermes channel knob: 



digital 


delay (ps) 


plus 


skew 




analog 






skew 


0 


0 


no 


1 


320 


yes 


2 


400 


yes 


3 


470 


ves 


4 


570 


yes 


5 


670 


yes 


6 


770 


yes 


7 


870 


ves 



analog 
skew 


delay (ps) 


0 


Reserved 


1 


797 


2 


777 


3 


+ 40 


4 


+20 


5 


0 


6 


-10 


7 


-20 



Sedl"roTl?.'' 'T' ^^^^''f.Y^^^ °f 0 « loaded into the digital skew and 1 .s 
and Ho7 0 stnals" "^'""^ " minimum output delay for the HoC 
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Mnemosyne Memnn/ 

MicroUnity-s Mnemosyne memory architecture is designed for ultra-high 

AM '-''r'- integrates fast commumcation channels 

with bKAiM caches and intertaces to standard DRAM. 

The Mnemosyne interfaces include byte-wide input and output channels intended 
to operate at rates ot at least 1 GHz. These channels provide a packet 
communicatton link to synchronous SR.AM cache on chip and a controller for 
external banks of conventional DRAM components. Mnemosvne provides second- 
level cache and mam memory tor MicroUnity s Terpsichore svstem architecture, 
rtowever. Mnemosyne is useful in many memory applications. ' 

Mnemosyne's interface protocol embeds read and write operations to a single 
memory space into packets containing command, address, data, and 
frt'lSS^"'"'"'- ^'if P^^!^«^i"^"de check codes that will detect single-bit 
transmission errors and multiple-bit errors with high probabUitv. manv as eight 

.wZfv^! ^'T^T'-'r'^u ^^i^'^.i " expand the cache and memorv and to 
miprove the bandwidth ot the DRAM memory. 

rl^fS?."' ^^j^' ""l as a set of small blocb. which are 

Dv^^niJiciL .^T ' '/u f°gic^ 'nemory data of a fixed word size. 

SnT^X redundancy supports the elimination of faultv 

storage "^"'"ng the use ot non-volatile or one-time-programmable 

blnU°or«l' hJa^Y^'" connection of multiple 

t~cc J ^""^ ^F"^^ components to a Mnemosyne device. Variations m 
I^^A^L Ta '""i °f '"'""^'^ P"« be accommodated bv 

lT:ll^.u''Th°u ■•,egi""s. The interface supports interleaving 

Iddressfng " ^""^ P^«= ""^^^ ^""^^ to improve latency for localized 

^vn!n^^ uses Mnemosyne devices as a second-level cache, main-memory 
expansion, and optionally contaming directory information. Each Mnemosyne 
°!yi5f '"PP°"' "P/°/°"^ '=='"1^5 of DRAM, each 72 bits wide (64 bits + 

ECQ. Using standard DRANI components. Terpsichore and Mnemosvne achieve 
bandwidth m excess of 9 Gbyies/sec to secondary cache and 2 Gbytes/sec to 
mam memory. Terpsichore may use twice or four times the number of 
Mnemosyne devices to expand the cache and memorv and to increase the 
bandwidth ot the mam memory system to in excess of 8 Gbytes/sec. 

Architecture Fmmn\A/nrk 

The Mnemosyne architecture builds upon MicroUnitv's Hermes high-bandwidth 
channel architecture and upon MicroUnity's Cerberus serial bus architecture 
and complies with the requirements of Hermes and Cerberus. Mnemosvne uses 
parameters A and W as defined bv Hermes. 
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The Mnemosyne architeciure delines a compatible framework for a familv of 
implementations with a range ot capabilities. The foUowine implementation- 
dehned parameters are used in the rest of the document in boldface. The value 
indicated is tor MicroUnity's tirst Mnemosyne implemeniation. 



Param 
etsr 


Interpretaiion 


Value 


Range of legal values 




1092 iogical memory words in 
SRAM cache 


13 


C > 1 


B 


1092 physical memory words in 
SRAM cache physical memory 
block 


1 1 


B > 1 


S 


number of bits per word or an 
SRAM physical memorv block 


9 


S>0 


t 


size of tag field in cache entry, 
in bits 


13 


t = 2P + E - C 


e 


size of ECC held m cache entry, 
in bits 


8 


e > \ooo f8W+t'^l-+. e*4-1 

^» — ^57^ ^i^ww 1^ w J~ 1 


n 


number ot physical memory 
blocks used to produce a logical 
memory word 


10 


8W + t ■«- 1 -i- e 
S 


N 


number of SRAM physical 
memory blocks, not including 
redundant blocks 


40 


w = n(^*» ") 


D 


number of divisions ot SRAM 
physical memory blocks covered 
by separate sets of redundant 
blocks 


2 


1 <D< 16 


R 


number ot redundant SRAM 
physical memory blocks in each 
redundancy division 


2 


1 £R< 16 


P 


number of DRAM row and 
column address interface pins 


12 


9 < P < {A-8-E)/2 


K 


number of address interface pins 
which may be configured as row- 
address-only pins 


0 




1 


log2 of number ot interleaved 
accesses in DRAM interface 


2 


0 < 1 < 16 


E 


log2 of number of banks of 
DRAM expansion 


2 


1 ^ E < 15 



Interfaces and Block DiBomm 

Mnemosyne uses two Hermes unidirectional, byte-wide, differential, packet- 
oriented data channels for its main, high -bandwidth interface betu'een a memorv 
control unit and Mnemosyne's memory. This interface is designed to be 
cascadeable. with the output of a Mnemosyne chip connected to the input of 
another, to expand the size of memor/ that can be reached via a single set of data 
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channels. An external memory control unu is in complete control of the selection 
nd cZentlA>.r"°"' wuh.r. Mncmosvne and ir. complete control of the 
and content ot information on t.he higb-bandwidth interfaces. 

A Cerberus bit-serial intertace provides access to configuration, diaanostic and 
tester information, using TTL signal levels at a moderate data rate. 

Mnemosyne contains additional interfaces to conventional dvnamic random- 
access memory devices (DRAM) using TTL signals. Each Mnemosvne devSe 
contains output signals to independently control four banks of memo " 

each bank IS nominally 9 bytes ^.-ide. and connects to a single set of bidirecdonai 

nRA vr n Y 1!^^° v ^'P''">* 16Mx4 organized. 64-Mbit 

DRAM). Up to tour banks of DRAM may be connected to each Mnemos ne 
device, permitting up to 0.5 Gbyte of DRAM per Mnemosyne chip ' ' 

TTL inr.rf i ^second voltage of 5.0 Volts (5% tolerance) is used onlv for 

IIL«e^ LTng," '''''''''' " ™- P-^-S-B TAB (tap°; 

prweTI.0;^^^^^^^^^^^ »d 466 pins for 3.3V 



count 


pin 


meaning 


18 


HiU. Hi7 n 


hi-bandwidth input 


18 


MOU. H07 n 


hi-bandwidth output 


72 


UU71 n 


□RAM data 


48 


'^11. 03 0 


URAM address 


12 


HAS3 0. CAS3 0. WEi n 


DRAM .control 


6 


b(J. bD. SN^ 0 


v^eroerus interface 


174 




total Signal pins 


? 


VUU 


3.3 V above VSS 


? 




5.0 V above VSS 


? 




most negative supply 


640 




total pins 



"♦^Internal circuit documenration names this signal VDDO. 
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The following is a diagram of the Mnemosyne device interfaces: (Numerical values 
are shown for MicroUniry s fust implementation.) 



Hi7..0 



Ho7..0 HoC 

■ 



vcc 



VDD 



VSS 





720 kbit HiBRAM 
(8k X 90 HiSRAM, 
cache controller, 
DRAM controller) 



48 



A11..AO/3..0 RAS/3..0. DQ71..0 
CAS/3.. 0. 
WE/3..0 

Mnemosyne external block diagram 



Absolute Maximum Ratings 


MIN 


NOM 


MAX 


UNIT 





















































Recommended operating conditions 


MIN 


NOM 


MAX 


UNIT 


REF 


Vj: Termination equivalent voltage 


4.5 


5.0 


5.5 


V 




Main supply voltage VDD 


3.14 


3.3 


3.47 


V 


VSS 


TTL supply voltage VCC 


4.75 


5.0 


5.25 


V 


VSS 


Operating free-air temoerature 


0 




70 


C 
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Elecincal characteristics 


MIN 


TYP 


MAX 


UNIT 


REF 


Vqh- H-state output voitaae HoC. H07 0 








V 


VDD 


VOL- u-siaie output voitaae Hoc H07 n 








V 


VDD 


Vih: H-siaie input voitaae HiC. Hit r, 








V 


VDD 


viL. u bidit? mgui VOIiage niU. Ht7 q 








V 


VDD 


Ioh: H-state cutout current HoC. Ho? 0 








mA 




'vjL ^ uuiuui current nou. MO? n 








mA 




ttUJ' H-^tfltO irtmit r^i trrcint U41/^ LJI— 

iiM- fi oLaie iiipui curreni niu. ni? e) 








mA 




fiL. L. oiditi iiipui current niu, ni7 o 








mA 




^IN- inuui caDacicance niu. Hi? o 








pF 




^OUT uuipui capacitance HoC H07 n 








pF 




VQH- n-siace output voltage An 03 n, 

^^03 0* ^Ab3 0. WE3 0. DQ71 Q 


2.4 




5.5 


V 


VSS 


Vol: L-state output voltage An. .03 a. 

HAb3 0. CAS3..0, WE3 0. DQ71 0 


0 




0.4 


V 


VSS 


Vol: L-state cutout voltage SD 


0 




0.4 


V 


VSS 


Vih: H-state input voitaae DQ71 0 


2.4 




5.5 


V 


VSS 


V|l: L-state input voltage DQ71 0 ■ 


-0.5 




0.8 


V 


VSS 


Vih: H-state input voltage SD 


2.0 




5.5 


V 


VSS 


V(h: H-state input voltage SC. SN3 0 


2.0 




5.5. 


V 


VSS 


V)l: L-state input voltage SC. SD. SN3 n 


-0.5 




0.8 


V 


VSS 


Ioh: H-state output current An. .03 q. 
RAS3..0. CAS3..0. WE3 0, DQ71 0 








|iA 




Iol: L-state output current An..03 
RAS3 .0. CAS3..0. WE:? n. DQ71 0" 






16 


mA 




Ini " L-state outniit piirront 






16 


mA 




loz". Off-State output current SD 


-10 




10 


uA 




loz- Off-state outDut current DQ71 0 


-10 




10 


uA 




Iih: H-state input current SC. SN3 0 


-10 




10 


fiA 




liL' L-state input current SC. SN3 0 


-10 




lIO 


^A 




Cin: Input capacitance SC, SN3 0 






4.0 


pF 




Cout: Output or input-output 
capacitance. SD. An. .03 q. RAS3 0. 
CAS3..0. WE3 0. DO71 0 ■" 






4.0 


pF 
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Switchino rhAr?*rtprictir^ 

vjiviiv^i III i\j oiiai a^,«icri loii wO 


Ivi 1 N 


1 yP 


MAX 


UNIT 


tnr^' HiC clock cvrlp timp 


i AAA 
1 UUU 






ps 


tppw HiC clock hinh fimp 


4UU 






ps 


tBCL- ^^•C clock low time 


Ann 






ps 


tBT- HiC clock transition time 








ps 


tBs: set-up time. Hiy. o valid to HiC xiiion 


200 




1 nn 


ps 


Ibh: hold time. HiC xition to Hi? o invalid 


-200 




-100 


ps 


tos- skew between HoC and H07 0 


-50 




50 


ps 


tc: SC clock cycle time 


50 






ns 


iCH' SC clock high time 


20 






ns 


tcL' SC clock low time 


20 






ns 


tj: SC clock transition lime 






5 


ns 


ts: set-uD time. SD valid to SC rise 








ns 


Ih: hold time. SC rise to SD invalid 








ns 


too: SC rise to SD valid 


5 






ns 



Logical and PhvfiinRi Memory Rtn infi irf:^ 



Mnemosyne defmes two regions: a memon- region, implemented bv an on-device 
static RAM memory cache backed by standard DRAM memory devices, and a 
configuration region, implemented by on-device read-only and read/write 
registers. These regions are accessed by separate interfaces; the Hermes channel 
used to access the memory region, and the Cerberus serial interface used to 
access the configuration region. These regions are kept logically separate. 

The Mnemosyne logical memory region is an array of 2^^ words of size W bytes. 
Each memory access, either a read or write, references all bytes of a single block. 
All addresses are block addresses, referencing the entire block. 

8W>1 0 

0 I 

^ — — — — - 
2 ———————— ——-^——^ 



\^ Logical memory organization | 

Mnemosyne's DRAM memory physically consists of one or more banks of 
multiplexed-address DRAM memory devices. A DRAiM bank consists of a set of 
DRAM devices which have the corresponding address and control si^^^nals 
connected together, providing one word of W bytes of data plus ECC information 
with each DRAM access. 

Mnemosyne^s SRAM memory is a write-back (write-in) single-set (direct-mapped) 
cache for data originally contained in the DRAM memory. AH accesses to 
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clrhT°SX^^"'°'- '^fl ""^olV? ""^««ency benveen the coniencs of the 
cache and the contents ot the DRAM memory. 

Th/';;^'-;;? ' .^^^^^^f ""o^.^?""^ ^^^^sts of read-only and readAvme resisters. 
octlet configuration memory space is eight bnes: one 

Commu nications D/ipnng^/c 
Hiah-bsnn\A/ihth 

"'f Hermes high-bandwidth channel and protocols, 

impleraentmg a slave device. *^ "vuia, 

Lnm""En.?^*'^'" ""T high-bandvvidth communications channels, one 
input channel and one output channel. 

L^r^irr'the' hJ^' P^^« "^"cture Mnemosyne's memorv 

rre^ynV't:t^?e"^er^^^^^^^ '^'"--^^ ^^'^^ — >' 



.v.. fe""'T'''^'°" registers provide a low-lcvcl mechanism to detect skew in 
fc, ^rr**' "P"' channel and to adjust skew in the bvte-wide outpu channel 
This mechanism may be employed by softu-are to adaptivdy adjust for skew in the 

d'eret-d:^S Zt^ ^""'"^ '° « ^Set 



Serial 



L^oltT'J^^^^ ^""a "•'"'^ ? Mnemosyne device, set 

diagnostic modes and read diagnostic information, and to enable the use of the 
part within a high-speed tester. 

The serial pon uses the Cerberus serial bus interface. . 
DRAM 

The DRAM imerface uses TTL levels to communicate with standard, high- 
capaaty dynamic RAM devices. The data padi of the interface is 8W+ e bits. The 
UKAM components used may have a maximum size of 22P words by k bits where 
the minimum value of k is determined by capacitance limits. (Larger values of k 
up to 8W+e. meaning fewer components are required to assemble a word of 
UKAivls, are always acceptable.) 

Error Handling 

Mnemosyne performs error handling compliant with Hermes architect 



ure. 
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For the current implemcntanon. the toUouing errors are designed to be detected 
and known not detected bv design: 



errors detected 


errors not detected 


invalid check byte 


invalid identitication number 


invaiid command 


internal buffer overflow 


invalid address 


invalid check byte on idle packet 


uncorrectable error in SRAM cache 




uncorrectable error in DRAM memorv 





— , ""^^•■•■'^^'■'"'^ «-«iu. lu cuncr tne ortAivi cacne or the UKAM 

memory results m the generation of an error response packet and other actions 
more tully described elsewhere. 

Upon receipt of the error response packet, the packet oriainator must read the 
status register of the reporting device to determine the precise nature ot the error 
Mnemosyne devices reporting an invalid packet will suppress the receipt of 
additional packets until the error is cleared, by clearing the status reaistcr. 
However, such devices may continue to process packets which have already been 
received, and generate responses. Upon taking appropriate corrective actions and 
clearing the error, the packet originator should then re-send anv unacknowledged 
commands. ' 

Because of the large difference in dock rate between the high^bandwidth Hermes 
channel and the Cerberus serial bus interface, it is gcneraUv safe to assume that, 
atter detecting an error response packet, an attempt to read' the status reaister via 
Lerberus will result in reading stable, quiescent error conditions and "that the 
queue of outstanding requests v.-ill have drained. After clearing the status register 
via Lerberus, the packet originator may immediately resume sending requests to 
the Mnemosyne device. 

Cerberus Reo/sterq 

Mnemosyne's configuration registers comply with the Cerberus and Hermes 
specifications Configuration registers are internal read/onlv and read/write 
registers which provide an implementation-independent mechanism to query and 
control the configuration of a Mnemosyne device. Bv the use of these resisters, a 
user of a Mnemosyne device may tailor the use of the facilities in a "general- 
purpose implementation for maximum performance and utility. Conversely, a 
supplier of a Mnemosyne device may modify facilities in the device without 
compromising compatibiKty with earlier implementations. 

Read/only registers supply information about the Mnemosyne implementauon in a 
standard, implementation-independent fashion. A Mnemosvne user may take 
advantage of this information, either to verify that a compatible implementation of 
Mnemosyne is installed, or to tailor the use of the part to conform to the 
characteristics of the implementation. The read/only registers occupy addresses 
0..5. An attempt to write these registers may cause a normal or an error response. 
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ReadAvriie registers select the mapping of addresses co SRAM and DRAM banks, 
control the internal SRAM and DRAM ciming generators, and select power and 
voltage levels for gates and signals. The read/write registers occupy addresses 
6..11. 16.. 19, and 32. 

Reser\-ed registers in the range 12.,15. 20..31. and 35..65 must appear to be 
read/only registers with a zero value. An attempt Co write these registers mav 
cause a normal or an error response. 

Reser\^ed registers in the range 6A.2^(>A may be implemented either as read/onlv 
registers with a zero value, or as addresses which cause an error response if reads 
or writes are attempted. 

The format of the registers is described in the table below. The octlet is the 
Cerberus address of the register: bits indicate the posifion of the field in a register 
The value indicated is the hard-wired value in the register for a read/onlv resisier 
and is the value to which the register is initialized upon a reset for a read/write 
register It a reset does not initialize the field to a value, or if initialization is not 
required by this specification, a " is placed in or appended to the value field. The 
range is the sec ot legal values to which a read/write register mav be set The 
interpretation is a brief description of the meaning or utilitv of the register field; a 
more comprehensive description follows this table. 



octlet 
0 



bits 
63.. 15 



fiGtd name 



15..0 



octlet bits 
1 63. .16 



value range 



15..0 



architecture 
code 


0x00 

40 

a3 

49 

d2 

e4 




Identifies memory device as 
compliant with MicroUnity 
Mnemosyne architecture. 


architecture 
revision 


0x01 
00 




Device complies with architecture 
version l.O. 


field name value ranae interoretation 


impiementor 
code 


0x00 

40 

a3 

24 

6d 

f3 




identifies Mnemosyne Memory 
device as implemented by 
MicroUnity. 


impiementor 
revision 


0x01 
00 




Implementation version 1.0. 
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:ctlet bits 
2 63.. lb 



cctiet 
3 



15 .0 



bits 
63. 16 



oct!et bits 
4 63.. 60 
59.. 56 
55. .48 

47. .40 

39..36 



36..32 

31. .28 
27..24 

23..20 
19.. 16 
15..0 



octlet 
5 



bits 
63.0 



field name 


vaius ra.~ce 


iniercretation 


manufacturer 
code 


0x00 
40 
a3 
92 

be 

79 




Identifies ini:ial manufacturer of 
Mnemosyne Memory device 
implemented by MicrcUnity. 


manufacturer 
revision 


0x01 
00 




Manufacturing version 1.0. 


fisId name 


vaius 




interoretation 


serial 
number 


0 




This device has no serial number 
capabilitv. 


dynamic 
address 


0 




This device has no dynamic 
addressing capabilitv. 


field name 


valu'5 


ra-ae 


interoretation 


A 


4 


0.15 


size of a Mnemosyne address 


iog2W 


3 


0..15 


size of a Mnemosyne word 


C 


13 


0..25 
5 


log2 of cache capacity in words 


N 


40 


0.25 
5 


number of cache sub-biocks 
(excluding redundant blocks) 


D 


2 


0..15 


Number of divisions of cache-blocks 
covered by separate sets of 
redundant blocks. A zero value 
signifies 16 divisions. 


R 


2 


0..15 


Number of redundant blocks per 
division. A zero value signifies 16 
redundant blocks. 


P 


12 


0..15 


Number of row and column address 
interface pins 


K 


0 


0..15 


Maximum value by which column 
address pin count may be less than 
row address pin count. 


B 


2 


0.15 


log2 of number of banks of DRAM 
expansion 


1 


2 


0..15 


log2 of maximum interleaving level 
in DRAM interface. 




0 




Reserved for definition in later 
revision of Mnemosyne architecture 


field name 


value ranae 


interoretation 




0 




Reserved for definition in later 
revision of Mnemosyne architecture 
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cctle: 
6 



bus 
63 

62 

61 

50 
59 



58 

57 

56.-50 
49.. 48 
47 

46. .45 
44 

43..40 
39..36 
35..32 
31. .29 

28 
27..24 
23..16 



fietd name 



value rancc 



reset 



clear 



seiftest 



15..8 



7..Q 



tester 



0 



isolate/ 
synch 



source 



0 



ECC disable 



module id 



PLL bypass 



0 



PLL range 
extension 



PLL range 



0 



output slope 
control 



output slope 
address 



output slope 
data 



SRAM timingp 
extension 



SRAM timing Q 



ECC seed 
extension 



ECC seed 



cidle O 



cidie 1 



1 



0..1 



G..1 

■o..i 



set to invoke device's circuit reset 



0.1 



interpretation 



set to invoke device's logic clear 



set to invoke device s seiftest: bits 
60..48 may indicate depth of seiftest 



0..1 



D..1 



0 



G..3 



0..1 



0..15 



0 



0 



255 



-tester mode: if set. suppress cache 

misses/writebacks. 

tester mode: synch up 



^tester mode: set to 0. 
tester mode: source/analyze 



disable ECC checking: can be set 
during normal operating mode 



Reserved for additional modg hir^ 



C..1 



set to invoke tester mode 



Module identifier 



Setting this bit causes the PLL to be 
bypassed: the input clock signal is 
used directly 



Reserved for extensions to the PLL 
range control field. 



Set to 0 if the PLL is operating at a 
low frequency; 1 if the PLL is 
operating at a high frequency, 



0..15 Output slope for DRAM address 
' ^>ignals 



0.. 15 Output slope for ORAM data signals 



Reserved for additional SRAM liming 
control bits 



Set to 1 to extend SRAM timing by 
one clock cycle. 



a.25 
5 



Output slope for DRAM control 
'^' gnals 



extend ECC seed value when W > 8 



0..25 lvalue 
5 



to modify ECC code computed 
on incoming data. Used to exercise 
ECC detection/correction logic, or to 
write arbitrary patterns into memory. 



Value transmitted on idle Hermes 
output channel when output clock 
zero (0). 



0..25|Value 
5 



transmitted on idle Hermes 
output channel when output clock 
one (1). 
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bits 


field name 




intercretalion 


63 


reset/clear/ 
complete 




U.l 


J his bit is set when a reset, clear or 
seiftest operation has been 
completed. 


52 


reset/clear/ 
seiixesx 
status 


1 


0..1 


This bit is set when a reset, clear or 
seiftest operation has been 
comoleted successfully. 


51 


check byte 
error 


0 


0..1 


This bit is set when a received input 
packet has an incorrect check byte. 


60 


address error 


0 


0..1 


This bit is set when a received input 
request has an address not present 
on the device as confiqured. 




command 
error 


0 


0..1 


This bit is set when a packet is 
received on the Hermes input 
channel with an imorooer command 


58 


un« 

correciaoie 
ECC ^rrnr 

www VWtSJT 


0 


0..1 


This bit is set when an uncorrectable 
error is discovered in memory. 


57 


correctable 
ECC error 


0 


0..I 


This bit is set when a correctable 
error is discovered in memory. 




other error 


0 


0..1 


This bit is set when other errors not 
otherwise specified occur. 


55. .53 


0 


0 


0 


Reserved 




Hivii/^ drive 




0..15 


This read/only field indicates the 
drive strength of PMOS devices 
expressed as a digital binan/ value. 


47. .41 


0 


0 


0 


Reserved 


dn 


KLu in range 




0.-1 


This bit indicates that the Hermes 
input channel clock and the PLL are 
at rates such that the PLL can lock 


o^. .ou 


u 


0 


0 


Reserved 


29 


ECC location 
flag 


0 


0..1 


0 if ECC error was in cache memory. 

1 if ECC error was in DRAM memon/. 


28 


dirty flag 


0 


0..1 


Dirty bit if error was in cache memory 


27.. 24 


ECC 
synuronne 
extension 


0 


0 


extend ECC syndrome value when e 

> 8 


23.. 16 


ECC 
syndrome 


0 


0..25 
5 


Value of syndrome encountered on 
previous correctable or 
uncorrectable ECC error. 


15..8 


raw 0 


0 


0..25 
5 


Value sampled on Hermes input 
channel when input clock is zero (0). 


7..0 


raw 1 


255 


a. 25 
5 


Value sampled on Hermes input 
channel immediately following 
sample value in raw 0 register. 
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ccUei cits 
8 63. ,32 

31.0 



octlet bits 
9 53.60 

59..56 
55. .52 
51. .48 

47. .40 
39..32 
31. .24 
23..16 
15..8 

7..0 

octtet bits 
10 63..56 



55..48 
47..40 
39. .32 

31 
30..24 
23..0 



octlet bits 
11..15 63..0 r 



fteld name 


value 


^ 3 rj c 


' "iterorstation 


0 


0 




Reserved for handling larqer address 

SDBCeS. 


ECC addr 


0 


2-1 


Address at which an ECC error wa<s 
detected. 


Held name 


value 


ranos 


interoretaiion 


log2id 


0 


0.1 


Number of DRAM interleaving levels 
can be computed as id = 2'o92id 


expand 


0 


1..E 


Number of DRAM banks 


r 


0 


I9..P 


Number of bits in DRAM row address 


c 


0 


9.P 


Number of bits ia DRAM column 
address 


t1 


0 


0.15 


Address set up time relative to RAS 


12 


0 


0..15 


Address hold time after RAS 


t3 


0 


0..15 


Address set up time relative to CAS 


t4 


0 


0..15 


CAS pulse width 


t5 


0 


0..15 


Page mode cycle time is t3+t4+t5. 
Page mode CAS precharge is t3+t5 


to 


0 


0..15 


RAS precharge is tS+tl 


■ '^tVij t lai 1 It* 


value ranae 


interoretation 


t7 


0 


0..15 


CAS to RAS set up for refresh cycle. 
t7 >=t1 to ensure RAS precharge is 
met. 


t8 


0 


U..1b 


\ ime data bus occupied from end of 
CAS low 


t9 


0 


L)..16 


Time output data on bus from start of 
t3 


tio 


0 


0..15 


nterval between two address bus 
transitions 


refresh 
enable 


D 


1.1 


f set. generate refresh cycles. 


til 




}..12 
7 


nterval between refresh cycles. 


0 


D 


3 Reserved 


field name value ranae 


interoretation 




3 0 If 


Reserved | 
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octtet Dits 
'*d 53.-55 

55.-48 

47. .40 

39..32 

31. .26 
25..24 

23..22 

21. .20 

19.. 18 

17. .16 

15. .14 

13.. 12 

11. .10 

9..8 

7..0 



process 
.control 


0x42 


0..25 
5 


Set global power and voltage swing 

la\/p|c ^ ^ 


lO control 


Oxc2 


C..25 
5 


Set power and voltage swing levels 
III if\j oircuiis. 


clock dist 1 


Oxc2 


0..25 


Set power and voltage swing levels 
in ciocK uisirtDution circuits. 


clock dist 2 


Oxc2 


0..25 


Set power and voltage swing levels 
in clock distribution circuits. 


0 


0 


0 


Reserved 


aigitai skew 
elk 


0 


0.3 


Set number of skew delay circuits to 
insert in output HoC. 


aigitai skew 
bit 7 


0 


0..3 


Set number of skew delay circuits to 
insert in output Ho7. 


digital skew 
bit 6 


0 


0..3 


Set number of skew delay circuits to 
insert in output Ho6. 


digital skew 
bit 5 


0 


0..3 


Set number of skew delay circuits to 
insert in output Ho5. 


digital skew 
bit 4 


0 


0..3 


Set number of skew delay circuits to 
insert in output Ho4. 


bit 3 


u 


U..0 


Set number of skew delay circuits to 
insert in output Ho3. 


digital skew 
bit 2 


0 


0..3 


Set number of skew delay circuits to 
insert in output Ho2. 


digital skew 
bit 1 


0 


0..3 


Set number of skew delay circuits to 
insert in output Ho1. 


digital skew 
bitO 


0 


0..3 


Set number of skew delay circuits to 
insert in output HoO. 


analog skew 
elk 


0xc2 


0..25 
5 


Set power and voltage swing levels 
in Hoc skew delay circuits. 
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octiei i:iis 

17 63. .56 

55. .46 
47 .40 
39..32 
31. .24 
23.. 16 
15..8 
7..0 

octlet bits 

18 63. .56 

55. .48 
47. .40 
39..32 
31. .24 
23. .16 
15..a 
7..0 



neid narne 


value 




interDfeiaiion 


analog skew 
bit 7 


Oxc2 


C..25 


Set Dovver and voltaae swinn fp\/Al«: 
in Hq7 skev; delay circuits. 


analog skew 
bit 6 


0xc2 


0..25 


SsT nowpr Arci \/nttano c\A/inn lo\/cic 

in Hc6 skew delay circuits. 


analog skew 
bit 5 


0xc2 


D 25 


oci jjuwci cinu voiiaye swiny leveiS 

in HoS skew delay circuits. 


analoQ skew 
bit 4 


Oxc2 


5 


oci [juwt^r ana voiiags swiriQ levels 
in Ho4 skew delay circuits. 


analog skew 
bit 3 


0xc2 


0.25 
5 


Set power and voltage swing levels 
in HoS skew delay circuits. 


analog skew 
bit 2 


Oxc2 


0..25 
5 


Set power and voltage swing levels 
in Ho2 skew delay circuits. 


analog skew 
bit 1 


Oxc2 


0.25 
5 


Set power and voltage swing levels 
in Hoi skew delay circuits. 


analog skew 
bit 0 


Dxc2 


0.25 
5 


Set power and voltage swing levels 
in HoO skew delay circuits. 


field name 


value rarne 


•nteroreration 


cDAim MIMA 
oiiMivi pipe 


0xc2 


0..25 
5 


Set power and voltage swing levels 
in SRAM pipeline circuits. 


i#nMivi aaxa 


0xc2 


0..25 
5 


Set power and voltage swing levels 
in DRAM data circuits. 


net A M 
address 


0xc2 


0.25 
5 


Set power and voltage swing levels 
in DRAM address circuits. 


PLL in range 
indicator 


Oxc2 


0..25 
5 


otsi ijuwcr dfiu voiiage swing levels 
in PLL in-range detector circuits. 


PLL phase 
detector 


0xc2 


0..25 
5 


Set power and voltage swing levels 
in PLL phase detector circuits. 


forward 
logic 


0xc2 


0.25 
5 


Set power and voltage swing levels 
in packet forwarding logic circuits. 


forward PLA 


0xc2 


0..25 
5 


Set power and voltage swing levels 
in packet forwarding PLA. 


tester logic 


0xc2 


0.25 
5 


Set power and voltage swing levels 
in tester logic circuits. 
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octlet bus 
19 63. .56 

55. .48 

47. .40 

39.32 

31. .24 

23.. 16 

15..8 

7.0 


Meld name value -a-ce irr^mrox^unn 


tester PLA 


Cxc2 


0..25 
5 


Set power and voltage swing levels 
in tester PLA. 


dual port 
RAMs 


0xc2 


0..25 
5 


Set power and voltage swing levels 
in 2-port RAM circuits. 


big FLA 


Oxc2 


0.25 
5 


Set power and voltage swing levels 
in big PLAs. 


small PLA 


0xc2 


0.25 
5 


Set power and voltage swing levels 
in small PLAs. 


pipeline 
interface 


0xc2 


0..25 
5 


Set power and voltage swing levels 
in pipeline interface circuits. 


other logic 2 


0xc2 


0..25 
5 


Set power and voltage swing levels 
in other logic circuits. 


other logic 1 


0xc2 


0..25 
5 


Set power and voltage swing levels 
in other logic circuits. 


other logic 0 


0xc2 


0,.25 
5 


Set power and voltage swing levels 
in other looic circuits. 


ocilei bits 
20..31 63. .0 


fieldname value rarre interoretaiion 


0 |0 Reserved. 


octlet bits 
32 63..56 

55..48 

47. .40 

39..32 
31. .0 

octlet bits 
33..63 63..0 

octlet bits 
64.. 63.. 0 
65536 


field name value ra--e intPrnrPt;=»finn 


redundant 0 


0 


D..25 
5 


Enable and address for redundant 
block 0 (partition 0) 


redundant 1 


0 


0..25 
5 


Enable and address for redundant 
block 1 (partition 0) 


redundant 2 


0 


0.25 
5 


Enable and address for redundant 
block 0 (partition i) 


redundant 3 


0 


0..25 
5 


Enable and address for redundant 
block 1 (partition 1) 


0 


0 


0 


Reserved for use with additional 
redundant blocks. 


field name value ranoe internrfitarinn 


0 


0 


0 


Reserved for use with additional 
redundant blocks. 


field name value ranoe intPrnrPtarinn 


0 


0 p 


Reserved for use with later revisions 
of the architecture. 


configuration memory space 



Identification ffeo/gterc? 

The identification registers in octlecs 0..3 comply with che rcquiremenis of the 
Cerberus architecture. 
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MicroUnity s company idemiiier is: OOiDO 0000 0000 0010 1100 0101. 

MicroUnity-s architecture code for Mnemosyne is jpecilied by the following table: 



Internal code name 


Code number | 


Mnemosyne 


0x00 40 a3 49 d2 e4 



Mnemosyne architecture revisions arc specified by the following table 



1 Internal code name 


Code number | 


|l.O 


UX01 00 



xMicroUnity's Mnemosyne implementor codes are specified by the following table: 



Internal code name 


Code number | 


MicroUnity 


□xOO 40 a3 24 6d f3 | 



rin^fVc"!'^ ' ?^ implemented by MicroUnity. uses implemen 

codes as specitied by the following table: 



cation 



Internal code name 


Revision number 


1,0 


0x01 00 















MicroUnity's Mnemosyne as implemented by MicroUnitv. uses manufacturer 
codes as specified by the following tabic: 



1 Internal code name 


Code number 


1 Hollers 


0x00 40 a3 92 b6 79 



MicroUnity's Mnemosyne, as implemented by MicroUnity, and manufactured bv 



Internal code name 


Code number 


1.0 


0x01 00 















Architecture De^Pir.rip Xion RRni.^t^<^ 

The architecture description registers in ocdets 4 and 5 comply with the Cerberus 
and Hermes specifications and contain a machine-readable version of the 
architecture parameters: A, W. C. N. D. R. P. K. E. and I described in this 
document. 
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Control R&ai'^t^r 

The control register is a 64.bit register with both read and write access. It is 
altered only by Cerberus accesses: Mnemosyne does not alter the values written 
to this register. 

The reset bit of the control register complies with the Cerberus specification and 
provides the ability to reset an indivnduaJ Mnemosv-ne device in a system. Settinc 
this bit is equivalent to a power-on reset or a broadcast, Cerberus reset (low level 
on 5U tor » cycles) and resets contiguration registers to their power-on values, 
which IS an operating state thai consumes minimal current. At the completion of 
t^r In!f / 7,"^<='"'/""f»«.t complete bit of the status register is 

set. and the reset/clear/selftesr status bit ot the status register is set. 

The clear bit of the control register complies with the Cerberus specification and 
provides the ability to clear the logic of an individual Mnemosyne device in a 
system Setting this bit causes aU internal high-bandwidth logic to be reset, as is 
required alter reconfiguring power and swing levels. At the completion of the 
reset operation the reset/clear/selftest complete bit of die status register is set. 
and the reset/clear/selftest status bit of the status register is set. 

The selftest bit of the control register compHes with the Cerberus specification 
and provides the abdxty to invoke a selitcst on an individual Mnemosvne device in 
a system. Howwer, Mnemosyne does not define a selftest mechanism' at this time, 
so setting this bit uiU immediately set the reset/clear/selftest complete bit and the 
reset/clear/selftest status bit of the status register. 

The tester bit of the control register pro%-ides the ability to use a Mnemosvne part 
as a component of a High-bandM-idth test system for a Mnemosvne or other part 
using the high-bandwidth Hermes channel. In normal operation this bit must be 
cleared. When the tester bit is set. Mnemosyne is configured as either a signal 
source or signal analyzer, deperiding on the setting of the source bit of the control 
register. Four Mnemosyne parts are cascaded to perform the signal source or 
signal analyzer function. When the isolate/synch bit is set, a synchronization 
pattern is transmitted on the Hermes output channel and received on the Hermes 
input channel to synchronize the cascade of four Mnemosynes; the isolatc/svnch 
bus must be turned off startmg at the end of the cascade to properlv terminate the 
synchronization operation. 

When not in tester mode, the isolate/synch bit of the control register is used to 
initialize the SRAM cache and perform functional testing of the SRAM cache This 
bit must be cleared in normal operation. Setting this bit and setting the ECC 
disable bit of the control register suppresses cache misses and dirtv cache line 
writebacks, so that the contents of the SRAxM cache can be tested as if it were 
simple SRAM memory. A read-allocate command returns the ocdet data from the 
SRAM cache entry that would be used to cache the requested location the data is 
unconditionally returned, regardless of the contents of the tag, dirtv and ECC 
fields of the SRAM cache entry. A read-noallocate command returns an ocdet in 
the following tormat; 
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=3 '--- 60 48 47 g ^ Q 

|d|0| tag t undefined 1 ecc I 

1 2 13 40 ' — a — ' 

A urite-aUocate command urues the octlet data, along with the dirtv bit set the 
tag corresponding to the requested location, and valid ECC data into the SRAM 
cache entry that would be used to cache the requested location. A vvrite- 
noallocate command writes the octlet data, along with the dinv bit cleared, the taa 
inr^r^fer t '"''""''u ECC data as if the dirtv bit was set". 

Frr i r'f^ f^u"^ "'"^ '° "^^^^ requested location. 

Ihe ECC seed field of the control register can be set to alter the ECC data that 
would otherwise be written to the SRAM cache entrv. to write completely 

ECcTawTvabe. °' '° ""'^""^ '^'^ 

JiLf fn^rhl'^^Sl^?" i the control register causes Mnemosyne to ignore ECC 

ITZ ^^^1 If^" '"'^ ^^^^^ '"«'"°^>'- This bit mav be set during 

normal operation ot Mnemosyne. ' ^ 

The module id field of the control register sets the module address for 
Mnemosyne. The module address defines which one of four module addresses 
Mnemosyne wiU select to answer to read and write requests. 

?i,Tl"!u*K ^f^^ u^?"' ^" °^ '^g'^^" ""ses the internal docking of 

tiifnotaC^^^^^^^^^ " °" ™^ 

the internal PLL A three^bit field is reserx-ed for this function, of which one bit is 
currendy defined: if the PLL range is set to zero, the PLL will operate at a low 
frequency (below O.xxx GHz ). if the PLL range is set to one. the PLL wiU ope ate 
at a high frequency (above O.xxx GHz). 

The output slope fields of the control register set the slew rate for the TTL 
outputs used for DRAM control, address and data signals, as detailed in a 
loilowmg secuon. 

Mnemosyne uses a sufficiently high-frequency clock that internal SRAiM timing 
can be controUed by synchronous logic, rather than asynchronous or sclf-umed 
logic. Internal SRAM tumng may be controUed by loading values into configuration 
registers. The current speaficadon reserves four bits for control of SRAM timing- 
one is currendy used. ^' 

The SRAM timing bit is normally cleared, providing internal SRAM cycle dme of 4 
dock cycles. Setting the SRAM timing bit extends the cycle time to 5 clock cycles. 

The ECC seed field of the control register provides a mechanism to cause ECC 
errors and thus test the ECC circuits. The field reserves 12 bits for this purpose 8 
bits are used m die current implementation. The field must be set to zero for 
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normal operation. The value of the field is xor'ed against the ECC value normallv 
computed tor write operation. 

The cidle 0 and cidle 1 fields of the control register provide a mechanism to 
repeatedly sent simple patterns on the Hermes output channel for purposes of 
testmg and skew ad)ustmeni. For normal operation, the ddle 0 field must be set to 
zero (0). and the cidle 1 field must be set to all ones (255). 

Status f^egistRr 

The status register is a 64-bit register \nth both read and write access, though the 
only legal value which may be written is a zero, to clear the register. The result of 
writing a non-zero value is not specified. 

The reset/cUar/selftest complete bit of the status register complies with the 
Cerberus specitication and is set upon the completion of a reset, clear or seUtest 
operation as described above. 

The reset/dear/selftest status bit of the Status register compUes with the Cerberus 
specitication and is set upon the successful completion of a reset, clear or selftest 
operation as described above. 

The check byte erior bit of the status register is set when a received input packet 
has an incorrect check byte. The packet is otherwise ignored or foru-ardcd to the 
Hermes output channel, and an error response packet is generated. 

The address error bit of the status register is set when a received input request 
packet has an address which is not present on the device as currently configured. 
An error response packet is generated. 

The command error bit of the status register is set when a packet is received on 
the Hermes input channel with an improper command, such as a read, write or 
error response packet. 

The uncorrrectafale ECC error bit of the status register is set on the first 
occurrence of an uncorrectable ECC error in either the SRAM cache or the 
DRAM memory. The ECC location flag is set or cleared, bdicating whether the 
error was in the cache memory (cleared, 0) or the DRAM memory (set. 1). The 
ECC syndrome field of the status register is loaded with the syndrome of the data 
for which the error was detected. The ECC addr register' is loaded with the 
address of the data at which the error was detected. An error response packet is 
generated. Once one uncorrectable ECC error is detected, no further correctable 
or uncorrectable ECC errors are reported in the status register until this error is 
cleared by writing a zero value into the status register. 

The correctable ECC error bit of the status register is set on the first occurrence 
of a correctable ECC error in either the SRAM cache or the DRAM mcmorv. 
provided an uncorrectable ECC error has not already been reported. The ECC 
location flag is set or cleared, indicating whether the error was in the cache 
memory (cleared. 0) or the DRAM memory (set, 1). The dirty flag indicates, for an 
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error in che cache memor^^ the value ot :he dirty bir. The ECC svndrome held of 

detected. The ECC addr register ,s loaded with the address of the data at u'hich 
orrectrbirF??'""^- ""^correctable ECC error is detected, no further 

correctable ECC errors are reported in the status reaister until this error is 

errfviir'cZ? ' ^"'"l t -S"^"- The occurrence o^Jh! 

error uill cause a response packet to be generated with a "stomped" check bvie 
pattern, but is not exphcitly reported uith an error response packet. 

Ln,r °- 'Tu- 'T'" ^^-^^^ '"^'^ o'heru-ise .specified 

occur. There are no errors ot this class reponed by the current implementarion. 

bd1crt^S^the*'^rK./l;'"^'*'J''^'* °^ ''''''' ' ^"d/onlv field that 

indicates the drive strength, or conductance gain, of PMOS devices on the 
Mnemosyne chip expressed as a digital binar>. vafue. Tht fidd i us d o cdLat 
the power and voltage level configuration, given variations in proce ! 
characteristics ot individual devices. The interpretatbn of the field is gi^en^bv the 



value 


PMOS drive strencith 


0 




1 


O.r nominal ' 


2 


0.2'nominal ' ' — 


3 


0-3*nominal " ' ~— 


4 


0.4'nominai 


5 


0.5'nonninal ' ' ' 


6 


0.6'nomjnal ' ~ 


7 


O.Z'nominal " 


8 


0.8'nominal ~ 


9 


0.9-nominal ""^ " — 


10 


nominal — 


1 1 


1.1'nominal ~ 


12 


1.2'nominal — 


13 


1 .3*nomlnal 


14 


1.4'nominal 


15 


1.5'nominal ' 



ctnl\r)nX \ TrV. vii '^%f"'"s '"^icates that the Hemes input 

th„ Ti Pn 1 oscdlator are running at sufficiently similar rates such 

that the PLL can lock. ITiis bit is used to verify or caUbraie the settings of the PLL 
range field of the control register. 

The ECC Iocation flag bit of the status register, described above, indicates the 
location of an uncorrectable ECC error of a correctable ECC error If the bit is 
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The dirty flag bit of the siacus register, described above, exhibits the dirtv bit read 
from cache memory that results in an uncorrectable ECC error or correctable 
ECC error. The value is undefined if the currently reported ECC error u-as read 
from DRAM memory. 

The ECC syndrome field of the status register, described above, exhibits the 
syndrome of an uncorrectable ECC error or correctable ECC error, A 12-bit field 
is reser\'ed for this purpose: the current implementation uses eight bits of the 
field. The values in this field arc implementation-dependent. 

ECC syndrome values representing single-bit errors for MicroUnitv's first 
implementation are detailed by the following table. Entries of * are not covered by 
the ECC code; syndrome values not shown in this table are uncorrectable errors 
involving two or more bits. 



syndrome for x= 


7 


6 


5 


4 


3 


2 


1 


0 


syndromex*o 


128 


64 


32 


16 


8 


4 


2 


1 


datayj-o 


127 


124 


122 


121 


113 


117 


115 


112 


daca^^S 


158 


157 


155 


152 


151 


148 


146 


145 


datax^l6 


174 


173 


171 


168 


167 


164 


162 


161 


datax4.24 


191 


188 


186 


185 


182 


181 


179 


176 


dacax+52 


206 


205 


203 


200 


199 


196 


194 


193 


datax-K40 


223 


220 


218 


217 


214 


213 


211 


208 


datax>48 


239 


236 


234 


233 


230 


229 


227 


224 


datax+56 


254 


253 


251 


248 


247 


244 


242 


241 


addrx+o 




* 


-* 




* 


* 




* 


addrx+8 


62 


61 


59 


* 


* 


* 


* 


* 


addrx*i6 


94 


93 


91 


88 


87 


84 


82 


81 


addrx+24 .- 




98 


97 


dirty bit 






100 



The raw 0 and raw 1 fields of the status register contain the values obtained from 
two adjacent samples of the Hermes input channel. The raw 0 field contains a 
value obtained when the input clock was zero (0). and the raw 1 field contains the 
value obtained on the immediately following sample, when the input clock was (1). 
Mnemosyne must ensure that reading the status register produces two adjacent 
samples, regardless of the liming of the status register read operation on 
Cerberus. These fields are read for purposes of testing and control of skew in the 
Hermes channel. 

ECC Address Register 

The ECC addr register indicates the address at which an uncorrectable ECC 
error or correctable ECC error has occurred. Bits 63..2P'fE of the ECC addr 
register are reserved; they read as 0. If the ECC location flag bit of the status 
register is zero, the ECC addr register contains the cache address in bits C-1..0. 
and the uncorrected cache tag in bits 2P+E-1..C. 
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DRAM Acicir=^,s \/]p.nninn 

Mnemosyne may interleave up to 2l DRAM accesses in order to provide for 
T''Z°x^^'T'' °* ^h^DILAM memor>- system ac the maximum bandwidth of 
the DKAM data pms. At any point in time, u-hile some memorv devices are 
engaged m row precharge others may be driving or receiving data, and others 
may be receiving roxv or column addresses. In order to maximize the utilitv of this 
interleaving, the logical memory address bits which select the bank are the 

least-signmcant bits. 

A logical memory address determines which bank of DR.-\M is accessed the rou- 
and column of such an access, and which interleave set is accessed. The diagram 
below shows the ordering of such fields in a general DRAM configuration: the bit 
addresses and field sizes shown are tor a four-byte logical memon- address and a 
two-way interleaved contiguraiion ot iM-word DRAM dexices. 
22 21 20 11 10 



-1^') "^ow I col fi^ 
1 io^ To V 

?n"^"'?r u''""* ^^'""^ '"^^ ^-"lues in the select, row. 

and int fields as a current y active request is queued until the completion of the 
acme request, at which time the second request may be handled usina a page 
mode access. This mechanisnr. helps to maintain high bandwidth access ev?n 
when the requests may not be perfectly interleaved, and provides for lower 
latency access m the event that the address stream is sufficiently local to take 
aa\*antage or page mode access. 

Mnemosyne dexices may be cascaded for additional capacity, using the ma field in 
the packet formats. The memory controller must make the mapping between a 
contiguous address space and each of the separate address spaces made available 

I'lrroiuf k ^T^"- '"f'^in""" performance, the memorv 

controller should also interleave such address spaces so that references to 
adjacent addresses are handled by different devices. 
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DRAM Tannine Conxroi 

An internal state machine uses configurable settings to generate event timing, to 
accommodate DRAM performance variations. The timing'of DRAM read cycles to 
a single DRAM bank is shown below: 



A 

CAS 
WE 
D 





0 , 








1 

ZM 






















/ 






\_ 


— ( ° ) 


( ° ) 


I • 



DRAM reap cycles 



The timing of DRAM write cycles to a single DRAM bank is shown below: 
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The timing of a read cycle followed by a write cycle to a single DRAM bank is 
shown below: 




> 



DRAM read cycle followed by write cycle 



The timing of a refresh c>-cle to a single DRAM bank is shown below: 



; — ^ 


11 . : • 

■"7—1 3 — mn B ^ 




A 




■RaS 

CAs"^ 


\ 




\ 




"WE 

D — — ^ . . 


DRAM refresh cycle 
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The time inter\-als shown in the figurei above conrrol the following events: 



interval 


units 


meanina 


t1 


4 


Row address sat up time relative to RAS. 


t2 


4 


Row adaress hc!c time after RAS. 


t3 


4 


Column address set uD time relative to CAS. 


t4 


4 


CAS pulse width. Trie data bus is sampled for a read 

\^y\^\sz ciL lilts "fiij vji 1*+. 


ts 


4 


Kage mode cycle time is l3+{4+t5. 
Page mode CAS orecharge is t3+t5. 


ts 


4 


RAS precharge is t6+l1. 


t7 


4 


CAS to RAS set up for refresh cycle. 17 >=t1 to ensure 
RAS precharge is met. 


te 


4 


Time data bus assumed to be occupied (by DRAM) after 
end of CAS lov/ (end of t4) during read cycle. During t8. 
Mnemosyne will not drive CAS low for a read from 
another DRAM bank, or start a write cycle to another 

DRAM hAnk 


t9 


4 


Time data bus anven (by Mnemosyne) from column 
address drive (start of 13) during write cycle. During t9. 
Mnemosyne will not drive CAS low for a read from 
another DRAM bank. -or start a write cycle to another 
DRAM bank. 


tio 


4 


Interval between two address bus transitions. During t10. 
Mnemosyne will not change the address bus of another 
DRAM bank. This limits the noise generated by slewing 
the TTL address bus signals. 


til 


1024 


Interval between refresh cycles. 



Additional DRAM operations may be requested before the corresponding DRAM 
bank is available, and are placed in a queue until they can be processed. 
Mnemosyne will queue DRAM wnies with lower priority than DRAM reads, 
unless an attempt is made to read an address that is queued for a write operation. 
In such a case. DRAM writes are processed until the matching address is written. 
Mnemosyne may make an implementation-dependent pessimistic guess that such 
a conflict occurs, using a subset of the DRAM address to detect conflicts. The 
number of DRAM writes which are queued is implementation-dependent. 

Mnemosyne uses one address bus for each interleave because dynamic power and 
noise is reduced by dividing the capacitance load of the DRAM address pins into 
four parts and only driving one-fourth of the load at a time. A timer (tlO) prevents 
two address transitions from occurring too close together, to prevent power, and 
noise on each address bus from having an additive effect. In addition, the loading 
of the already divided RAS, CAS, and VCE signals is closer to the loading on the A 
signal when the address bus is also divided, reducing effects of capacitance 
loading on signal skew. 
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Power. S wno. <=;kf=>w snd S/frw Os!ibr=>finn 

. Mnemosyne uses a set of configuration registers to control the pouer and voltaee 
levels used tor internal high-bandu-idth logic and SK\M memorv. to control ske\v 
in the output byte-channel, and to control sleu- rates in the TTL' output circuits of 
bdovv '^^"'^^ °* programming these registers are described 

Eighc-bit fields separately control the pou-er and voltaae levels used in a portion of 
the .\Inemosyne circuitry. Each such field contains confiauration data in the 
toUowmg tormat: 



7 6 



Ivl 



a 3 

IE 



res 



power and swing controls 



Slou^ng^^^^^ ^""^ inierpretation of the fields is given by the 



field 


value 


inreroretation 


ov 


0..1 


For global setting control, if set. 
turns off current sources in order to 
protect logic from damage during 
changes to voltage and resistor 
settings. This bit must be set prior to 
changing settings and cleared 
aftenft^ards. For local setting control, 
If set. override these local settings 
by the global settings. 


Ivl 


0..7 


Set voltage swing level. 


res 


0..15 


Set resistor load value. 



Values and interpretations of the Ivl field are given by the following table: 



value 


voltage swinq level 


0 




1 




2 




3 




4 




5 




6 




7 





Voltage swing level control field interpretation 



274 



Case 2:05-cv-00505-TJW 

wo 97/07450 



Document 1 49 Filed 1 0/1 5/2007 



Page 38 of 40 

PCT/US96/13047 



/alues and interpretations of the res field are given by the following table: 



V3l UG 


iwotoiwj iwavj value 


0 




\ 




o 
c 




O 




A 
•+ 




5 




6 




7 




8 




9 




10 




11 




12 




13 




14 




15 





load value control field interpretation 



When Mnemosyne is reset, a default value of 0 is loaded into each ov field s 
each cur field and xxx in each res field. 
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The digital skew fields set the number ot delay stages inserted in the output path 
ot the Hoc and the Ho/.-O high-bandwidth output channel signals. Setting these 
tields. as well as the corresponding analog skew fields, permits a fine level of 
control over the relative skew between output channel signals. Nominal values for 
the output delay for various values of the digital skew and analog skcvi- fields are 
given below: ° 



digital 
skew 


analog 
skew" 




0 


anv 


0 


1 


A-'- 


135 




B 


155 




C 


175 




D 


195 




E 


215 


2 


A 


220 




B 


260 




C 


300 




D 


340 




E 


380 


3 


.A 


330 




B 


390 




C 


450 




D 


510 




E 


570 



When Mnemosyne is reset, a default value of 0 is loaded into the digital sk 
heids, seiung a minunum output delay for the HoC and Ho7..0 signals. 



ew 



need to get the right values for the analag skew setting to get these nominal values. 
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The output slope fields of the control register set the slew rate tor the TTL 
outputs used for DRAM control, address and data signals, according to the 

toiiovvina cable: 



selling 


siew rate (V/ns) for 
control, address sianals 


slew race (V/ns) for 
data sianals 




risina 


fallino 


nsino 


fallina 


0 










1 










2 










3 










4 










5 










6 










7 










8 










9 










10 










11 










12 










13 










14 










15 











SRAM RR dundanrw Manninn 

Mnemosyne uses a configurable set of redundant physical niemorv blocks to 
enhance the manufacturability of the cache memory. A systemafic method for 
determming the proper configuration is described below. 
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To help clantv the tollounng description, the figure below shows the loaical 
arrar^gement ot the physical memory blocks in the SIL\AI cache of MicroUni tv^ 
first Mnemos!^e iinplementatiorj. There are 40 phvsical memorv b ocks e"c>^ 

eT Th? fol: l^"' ^'^'^ ' bariks 0 To blocks 

each The 40 blocks are also divided into 2 partitions of 20 blocb each, and for 

each partition there are two redundant memorv blocks which can be confioured 
to substitute tor any ot the 20 blocks in that partition. The 40 blocks are a o 
div.ded up into , ranks containing 3 blocks each, where each rank con 1 ° 
distinct portion Of a cache line. A cache line contains eight bvtes of data a 13 bi 
tag, a dirty bit. four unused bits, and an S-bit ECC field. ' ' 



Legend: 

I n I physical memory block in bank n 
redundant memory block x 



mmmEii 



□ 



□□□□ 



arrangement of 
for MicroUnitv' 




rank 4 
rank 3 
rank 2 
ranki 
rankO 



physical memory blocks 
's first implementation 



Each redundant X field, where x is in the range O..D^'R.l, controls the enabling 
.nTEfrll. .a'"' ' ''""^^^ redundant block. Starting ai Cerberus addres! 
32 and bits 63 56. each successive byte controls a redundant block, covering each 
redundant blocks in pamtion 0. and then in successive bvres. blocks for additional 
partmons. In other words, the redundant x field is located at Cerberus address 

52+4 bits 63.(x mod 8)..56-(x mod 8). and specifies the redundant mappmg for 
block (x mod R). of redundant panition j|. The format of each redundant x field 



278 



Case 2:05-cv-00505-TJW Document 149 Filed 10/15/2007 Page 2 of 40 

wo 97/07450 PCT/USW/13047 



is detailed in chc toUouing figure, wich bit field sizes shown for MicroUniiv's first 



7 6 

EnE 



5 'I 



ra 



redundant block contrcis 



] 



loL^nt^a^c:"^'^ '"'^ interpretauon of the fields is given by the 



field 
en 


bits 
1 


value 
0..1 


mteroretation 

It set, use this redundant block to 
reolace a physical memory block 


0 


7+Jlog2 (S) 


0 


Pad control field to a byte 


ra 


-Jlog2(^) 




Replace physical memory block at 
rank ra with the redundant block. 


ba 






Replace physical memory block at 
bank ba with the redundant block. 



hfr i? rfc f^H^^^'i by first testing the SRAM cache u-ith the isolate/svncb 
^rU *^^^^°'\"°'/«8«t" s« and aU redundant x fields set to zero, and then again 

birV;f" '/'nVr '° '^^t''' """^ '^^ '"'J^ °f testing shiuld 
i?,^? ° TU^ ^ faUures .n the primary physical memory blocks and the 

redundant blocks. Then each of the failed primary blocks is replaced with a 
workmg redundant block by setting the redundant x fields as required. 

In order to map the address and bit identities of failures to physical block faUures 
the mternal arrangement of bits and fields mto blocks must be elaborated First a' 
Mnemosyne memory address is divided into four parts according to the following 
tigure. with bit held sizes shown tor MicroUnity's first implementation: 
31 26 25 



_0_ 

6 



I 



13 12 



tag 

13 



1 



Mnemosyne cache address layout 



ca 

1 1 



1 Q 

Ha] 



] 
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The interpretation of the fields is given by the follow ing cable: 



field 


bits 


interpretation 


0 


8A-2P-E 


IVll IQT no '7s,rr\ 


tag 


t 


These oils are stored into the cache 
on a write operation and compared 
against bits read from the cache on a 
read oceration. 


ca 


C-log2 (^) 


These cits are applied to the 
physical memory block to select a 
single SRAM cache word. 


ba 


'092 in) 


These bus are used to select one of 
— banks of physical memory blocks. 



For each cache address and cache bank, a -line- of information, containing a 
cache tag the cache data, and a dirty bit is stored. The internal arrangement ot 
th^se tields IS as shown in the following figure, with bit field sizes shown for 
iVlicroUnitys hrst implementation: 

^ _t8 1716 1312 

I^CCI data 



" I tag I 

13 



64 



Mnemosyne cache line layout 
for MicroUnity^s first implementation 



1 4 



The interpretation of the fields is given by the following table: 



field 


bits 


interpretation 


ECC 


e 


ECC bits used to correct single bit 
errors and detect multiple-bit errors 


data 


8W 


Data bits contain the visible cache 
data, as it appears in the packets 


d 


1 


Dirty bit: indicates that the cache 
line needs to be written to DRAM 
memory on a miss. 


u 


S n e-8W-1-t 


Unused bits pad cache line to even 
number of physical memory blocks 


tag 


t 


Tag bits Identify a Mnemosyne 
logical address for this cache line 



Mnemosyne cache line field interpretation 
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From the tables above, for each failure identified in the cache SRAM, a physical 
memory bank number, ba. can be identified from the iMnemosvne address, and a 
bir position, bi, can be identified from the Mnemosyne cache line lavout. the bit 
position specifies a physical memory partition number, pa, according to the 
following tormula: 



ibi mod s*^D 
pa = J 



90 83 82 18 1716 1312 Q 

^ata |d| u I tagH 



8 64 



1 4 13 



90 81 80 72 71 63 62 54 53 ^5 36 35 27 26 18 17 9 8 0 



j Partition from bit position 

' for MicroUnity s first implementation 



The bit position also identifies a ph\-sical mcmor>' block rank, ra. according to the 
tollovving formula: 



bi 



90 83 82 18 1716 13 12 Q 

data idl u t tag | 



64 1 4 13 

^ 1271 5453 3635 1817 0 

i ^ I ^ i 2 I 1 i 0 i 

^8 18 18 18 ~ 



Rank from bit position 
for MicroUnity s first innpieme ntation 



So, to correct a failure in the cache SRAiM, one of the working redundant blocks in 
the partition pa must be configured by setting a redundant x field, where x is in 
the range pa*D+D-l..pa*D, to the value: 

7 6 5 4 2 1 0 

m 0 1 ra I ba I 



redundant block controls 
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Multiple Memon/ Ohin^ 



Up to four Mnemosyne memon- devices may be cascaded to form effectively 
larger memones. The cascade of memory devices vviU have the same bandwidth as 
a single memory chip, but more latencv. 



cascade of four memory devices 



Packets are exphculy addressed to a particular Mnemosvne device; anv packet 
received on a device s input channel which specifies another module address is 
automatically passed on via us output channel. This mechanism provides for the 
serial interconnection of Mnemosyne devices into strings, which function 
Identically to a single Mnemosyne, except that a Mnemoswe string has larger 
memory capacity and longer response latency. ' 

All devices in a cascade must have die same values for A and W parameters, in 
order that each part may properly interpret packet boundaries. 

ReSDon<^ R Packnt Timinn 

!^cn«n?p"'' I ""'l'"^ P'"''" interpreted as a command causes a 

response packet to be generated. The latency between the end of the request 
packet and the beginning of the response packet is affected bv the processing and 
X7rlrh^ t P^-^kets. by the presence or absence of the requested word in 

the cache, by the setting of the SRAM and DRAM timing generators, by the 
fo^r^M^ queued DRAM write and read requests, as weU as other non 
configurable and implementation-dependent device parameters. 

Sn,^"" knowledge of the cache state, configurable parameters and 
implementation-dependent charactenstics. a memory controller may completely 
model the latency of responses. However, dependence on such characteristics is 
not recommended, except for testing and characterization purposes. 

SRAM accesses. DRAM accesses, and for^varded packets typicaUv have differing 
latency before a response or forw-arded packet is generated at the Hermes output 
Channel, so that certain combinations would imply that two output packets would 
need to overlap. In such a case, Mnemosyne will buffer the later output packet 
until such time as it can be transmitted. However, the number of requests that 
can be buffered is stncdy limited to eight (the number of idcndfication numbers) 
per Mnemosyne device. It is the responsibility of die issuer of command packets to 
ensure the number of outstanding packets never exceeds the Umits of die buffer 
Mnemosyne may use non-fair scheduling for forwarded packets to avoid buffer 
overtlow conditions. 

The use of DRAiM page mode accesses and interleaving requires knowledge of the 
relationship between a pair of transactions. Therefore, additional DRAM requests 
per interleave level may be transmitted before the time at which the DRAM 
controller may pertorm the request. These additional requests are queued and the 
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corresponding response packet is acncraced at a time controlled bv the DR.AM 
timing generator. DRAM interleaves are serviced in an implementation- 
dependent lashion to ensure starvation-free scheduling. 
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Persephone PCI AdRntRtinn 

MicroUnity's Persephone'^ PCI adaptation architecture is designed to enabie 
Terpsichore systems to employ interface cards that conform to the PCI" 
(Peripheral Component Interconnect) standard. 

'''(pur-sef'-uh-neel In Greek mythology. Persephone -.alio called Korel uas the bcauntui 
daughter ol Zeu» and Demeter who represented both nature s growth cvcle and deaih. Hades, 
aod of Ae under^^•orW and brother ol Zeus, was lonely in his under^vorld kingdom: therelore 
Zeus, u-ithout consulimg Demeter. told him to take Persephone as his wife. Thus, as Persephone 
was picking flowers one day. Hades came out of the earth and carried her off to be hb queen. 
\Vh.le the gnevmg Demeter goddess ot grain, searched for her daughter, the earth became a 
barren wasteland Zeus l.nally obtained Persephone's release, but because she had eaten a 
pomegrinate seed in the under%vorld. she u-as obliged to spend four months (wtnterl of each vear 
there, during which time barrenness returned to die earth 
^' PCI standard, version 2.0. 
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Calliope Interface 

Portions of this section has been temporarily removed to a separate document- 
Calliope Intertace Architecture.- chough it is still a mandatorv area of the 
lerpsicnore System Architecture. 

NiicroUnitv--s Calliope interface architecture is designed for ultra-hi«h bandwidth 
systems. The architecture integrates rast communication channels" with SRAM 
butter memory and mtertaces to standard analog channels. 

The Calliope interfaces include byie-uide input and output channels intended to 
operate at rates ot at least 1 GHz. These channels provide a packet 
communicauon link to synchronous SR.\M memorv on chip and a controller for 
mtertaces to analog channels. Calliope provides analog interfaces for MicroUnitv s 
Terpsichore system architecture. However. Calliope is useful in manv interface 
appiicaaons. 

Calliope's interface protocol embeds read and write operations to a single memorv 
space into packets containing command, address, data, and acknou'ledgement. 
rnXf^UW " ^.heck. codes that u-Ul detect single-bit transmission errors and 

muItiple-bit errors with high probability. As many as eight operations m each 
device may be m progress at a time. As many as four Calliope devices mav be 
cascaded to e.xpand the buffer and analog interfaces. 

Architecture Framn\A/nr^ 

The Calliope architecture builds upon MicroUnity's Hermes high-bandwidth 
channel architecture and upon MicroUnity's Cerberus serial bus architecture 
and complies with the requirements of Hermes and Cerberus. Calliope uses 
parameters A and W as defined by Hermes 
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The Calliope architecture defines a compatible trame\vork tor a familv of 
.mplemeniations with a range ot capabilities. The folloxving implementadon- 
defmed parameters are used in the rest of the document in boldface. The value 
mdicated is for MicroUmty s hrst CaUiope implementation 



Param 
eter 


Interpretation 


Value 


Range of legal values 


C 


1092 logical memory words in 
SRAM buffer 


1 1 


C > 1 


Al 


fiumoer ot ai audio inputs 


1 


Al <3 


AO 


number of AO audio ouiouts 


1 


AO <3 


PO, 
PI 


number of PO phone outpurs and 
PI phone inputs 


1 


PO = PI. PO S 3 


VI 


number of VI video inouts 


1 


VI < 3 


vo 


number of VO video outouts 


1 


VO <3 


IR, 11 


number of IR infrared outputs 
and II infrared inputs 


1 


IR = II. IR <3 


SO, 
SI 


number of SO smartcard cuicuts 
and SI smartcard inputs 


1 


SO = SI. SO <3 


EQ, 
CI 


number of tQ equalizers and CI 
cable inputs 


2 


EQ s or EQ < 3 




number of CO cable cutouts 


2 


CO ^ 3 




number of QHSK cable incuts 


1 


QPSK< 3 



InterfacG s and Rlnrk Diaomm 

CaUiope uses t^vo Hermes unidirectional, byte-wide, differential, packet-oriented 
data channels for its mam. high-bandxvidth interface between a memorv control 

oumm n?f r^n ' '"^"u-- ™' ^"'Sned to be cascadeable.' with the 

output of a Calliope chip connected to the input of another, to expand the 
interface resources that can be reached via a single set of data channels. An 
e.xternal memory control unit is m complete control of the selection and timing of 

v2%f Callidpe circuits use a single power supply voltage, nommallv at 3.3 
Volts (5 /o tolerance). A second voltage of 5.0 Volts (5% tolerance) is used only for 

S;ret Bonrg)" "^""^ ^'"'P""" ^ ™- '""^''^ P^^''^^'"^ 'T^P^ 

Pin assignments are to be defined: there are 174 signal pins and 466 pins for 3.3V 
power, 5.0V power and substrate, for a total of 640 pins. 
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nrsi int 

wUU 1 11 


U IN 


fTisanina 




niu. rll7 0 


hi-bandwidth input 


18 


Hoc. Ho? .0 


hi-bandwintn nntnut 




















6 


SC. SD. SN3 0 


Cerberus interface 


174 




total signal pins 


? 


VDD 


3.3 V above VSS 


7 


VCC-2 


5.0 V above VSS 


7 


vss 


most necative suddIv 


640 




total pins 



The follow-ing is a diagram of the Calliope device inierfaces: (Numerical values are 
shown for MicroUnity's first implemencation.) 




Calliope external block diagram 



^^Inicmal circuit documentation names this signal VDDO. 
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Absolute Maximum Ratings 


MIN 


NOM 


MAX 


UNIT 





















































Recommended ooerating conditions 




Vj: Termination equivalent voltage 


4.5 


5.0 






■■■ 


Main supply voltage VDD 


3.14 


3.3 


3.47 


V 


vss 


TTL supply voltage VCC 


4.75 


5.0 


5.25 


V 


vss 


Operating free-air temperature 


0 




70 


C 
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Electrical cnaractehsiics 


MIN 


TYP 


MAX 


UNIT 


RPr 


VoR. H-srate outout voltaae HoC. Ho? p. 








V 


VDD 


Vcl: L-5tate output voltaae HcC. Ho? n 








V 


VDD 


Vih: H-state inout voltaae HiC. Hit n 








V 


VDD 


Vil: L-state inout voltaae HiC. Hi? o 








V 


VDD 


Iqh: H-state output current HoC. Ho? o 








nnA 




lOL* L-state output current HoC. Ho? n 








mA 




Iih: H-state input current HiC, Hi? o 








mA 




L-state input current HiC. Hi/ q 








mA 




Cin: Input capacitance HiC. Hi? o 








pF 




Cour OuiDut capacitance HoC. Ho? o 








pF 




Vgn: H state output vcltago-A:i4-i;n 


2^ 






V 


ycc 


V^: L ctat: output voltage A^i 


9 






V 


ycc 


Vol" L-state output voltage SD 
Vp-t: H Gtato input voltage DQ:l^-q 


0 

3t4 




0.4 
5 5 


V 
V 


VS^ 

V 00 


Vm,: L stat2 input voltage DQz^-^ 
Vih: H-state input voltage SD 


-OS 
2.0 




5.5 


V 

v 


V/cc 
VSS 


Vih: H-state input voltage SC. SN3 n 


2.0 






u 

V 


V DO 


Vil: L-state input voltage SC. SD. SN^ n 


-0.5 




0 8 


V 


V 00 


Iqh: H state output current Awuj-.^^ q-. 












Ig^: L Gtate output current /\n ng q-. 






+6 






Iql: L-state output current SD 






16 


mA 




loz: Off-State output current SD 

log: Off jtato output current DQgu~^ 


-10 




10 






Iih: H-state input current SC. SN3 0 


-10 




45 

10 


HA 




Iil: L-state input current SC. SN3 0 


-10 




10 


pA 




Cin: Input capacitance SC. SN3 0 






4.0 


pF 




Cqut: Output or input-output 
capacitance. SD.-A44--Q3^7-RAS3-^; 






4.0 


pF 
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Switcning characteristics 


^>llM 

ivj 1 In 


TVD 


MAX 


UNIT 




1 o*f*+ 






ps 


teCH: HiC clock high time 








ps 


tBCL- HiC Clock low time 


600 






DS 


ibt: Miu Clock transition lime 






1 nn 
\ uu 


ps 


les: S6t-uD time. H17 n valid to HiC xition 


200 




100 


DS 


Ibh: hold time. HiC xition to Hi? 0 invalid 


-200 




•100 


ps 


tos- skew between HoC and H07 0 


-50 




50 


ps 


tc: SC clock cycle time . 


50 






ns 


tCH: SC clock high time 


20 






ns 


tcL- SC Clock low time 


20 






ns 


tj: SC clock transition time 






5 


ns 


ts: set-uD time. SD valid to SC rise 








ns 


Ih: hold time. SC rise to SD invalid 








ns 


too: SC nse to SD valid 


5 






ns 



Logical find Phv<=;innl MGmnr\/ ,^tri mt, ,ro 

5r!H^^RA''v?"" ^"?«"?or>• region, implemented bv an on device 

static RAM memory along with high-bandwidth control registers and a 
conhgurauon region, implemented by on-dcvice read-only and read/write 
registers. These regions are accessed by separate interfaces; the Hermes channel 
used to access the memory region, and the Cerberus serial interface used to 
access the configuration region. These regions are kept logicaUy separate. 

The Calliope logical memory region is an array of 28A words of size W bvtes Each 
memory access either a read or n-rite. references aU bytes of a single block. All 
addresses are block addresses, refcrencina the entire block. 



8W.1 



0 
1 
2 



28A.1 [ 



6W 



Logical memory organization 



J 



CaUiope-s SRAM memory is a buffer for data which flows to or from interface 
devices. 

CaUiope s configuration region consists of read-only and read/write registers. The 
size a logical block m the configuration memory space is eight bytes: one octlet. 
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Communications Chf^nn^i^ 

Hah-Qsndwirith 

Save°dcviT high-banduidth channel and protocols, implementing a 

Pnou^cLn^Y'''^ H""'" high-bandwidth communications channels, one 
mput channel and one output channel. 

mlhTHe^^U*!? """'^ "'"t^"^'- "^^"^ "° ^^^"""'^ corresponding 

to the Hermes-designated cache, so the no-aJlocace attribute of read and ^.-rite 
operations has no eftect.. 

Configuration-region registers provide a loxv-level mechanism to detect skew in 

This meT,^f, ^^""^V"'^.^? ^'^^^ byte-uide output channel. 

Jh nn^irr.'? ^^.'^'"P^^y^d by sotuvare to adaptively adjust for skew in the 

de -ice to dev cL ''"'^ P'"""' '° f'-'^^^ ^J^^^^' « '"^v- anse in 

ae\ice-to-aevice wiring. 

Serial 

diaSirir'L'Sf ^ T!,^'" ^ ."'.''i '° '^^"^^S"^^ Calliope device, set 
diagnostic modes and read diagnosnc information, and to enable the use of the 
pan withm a high-speed tester. 

The serial port uses the Cerberus serial bus interface. 

Error Handling 

Calliope performs error handling compUant with Hermes architecture. 

For the current implementation, the following errors are designed to be detected 
and known not deteaed by design: 



errors detected 


errors not detected 


invalid check byte 


invalid identification number 


invalid command 


internal buffer overflow 


invalid address 


invalid check byte on idle packet 




uncorrectable error in SRAM buffer 







Upon receipt of the error response packet, the packet originator must read the 
status register of the reportmg device to determine the precise nature of the error. 
Calliope devices reportmg an invalid packet will suppress the receipt of additional 
packets until the error is cleared, by clearing the status register. However, such 
devices may continue to process packets which have alreadv been received, and 
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generate responses. Upon taking appropriate corrective actions and clearing the 
error, the packet originator should then re-send any unacknoxvledged commands. 

Because of the large difference in clock rate between the high-bandvvidth Hermes 
channel and the Cerberus serial bus interiace. it is gcnerallv safe to assume that, 
atter detecting an error response packet, an attempt to read the status register via 
Cerberus uill result lo reading stable, quiescent error conditions and^hat the 

^l^eamope di.^r " '"™^^'«^>- ^^nding requests to 

Cerbeni^ Reaismr<=i 

Calliope-s configuration registers comply with the Cerberus and Hermes 
Cerberus registers are internal read/only and read/write registers 
IhJrl r ™plementation-inCe?endcnt mechanism to querv and control 

1 u/.r of'lT" Terpsichore system. By the use\f these registers 

n.,rnn' system may tailor the use of the facOities in a general- 

c.3! '"'P'ernentation tor ma.ximum performance and utility. Conversely, a 
Sr., " Terpsichore system component may modify facilities in the device 
without compromising compatibUity with earlier implementations. These registers 
are accessed via the Cerberus serial bus. registers 

^sJ^orCe^^T' ' Jf^P5»=l^°^<= 'y«em. each CaUiope interface contains 
Jnn?a„rf,„ • "«^g"""on registers. Additional sets of 

inrtrno F , 5"?'"' « Terpsichore system, 

including Euterpe processor devices, and xMnemosyne memory devices. 

Read/only registers supply information about the Terpsichore svstem 

implementation m a standard implementation-independent fashion. Terpsichore 

fmn^^LrrLITof r >r""^' '^if information, either to verify that a compatible 

" • !""^''^- °' '° '^"^ of »he pan to conform to 

the characteristics ot the implementation. 

The read/only registers occupy addresses 0..5. An attempt to write these registers 
may cause a normal or an error response. 

Read/write registers select operating modes and select power and voltage levels 
tor grates and signals. The read/write registers occupy addresses 6..7, 10.. 14 and 

Reser%'ed registers in the range 8..9. 15..2-! and 33. .63 must appear to be read/only 
registers with a zero value. An attempt to wite these registers may cause a normal 
or an error response. 

Reser^•ed registers in the range 64..21S.1 may be implemented either as read/onlv 
registers with a zero value, or as addresses which cause an error response if reads 
or writes are attempted. 
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The format of the registers is described in the cable below. The octlei is the 
Cerberus address oi the register: bits indicate the posifion of the field in a register. 
The value indicated is the hard-wired value in the register tor a read/only register, 
and is the value to which the register is initialized upon a reset for a read/write 
register. If a reset does not initialize the field to a value, or if initialization is not 
required by this specification, a *' is placed in or appended to the value field. The 
range is the sec of legal values to which a read/write register may be set. The 
interpretation is a brief description of the meaning or utility of the register field; a 
more comprehensive description follows this cable. 



octlet bits 
0 63. .16 



15. .0 



architecture 
code 


0x00 

40 

a3 

92 

b4 

49 




Identifies interface device as 
compliant with MicroUnity Calliope 
architecture. 


architecture 
revision 


0x01 
00 




Device connplies with architecture 
version 1.0. 



octlet bits 
1 63. .16 



15..0 



implementor 
code 


0x00 

40 

a3 

49 

db 

3c 




Identifies Calliope interface device 
as implemented by MicroUnity. 


implementor 
revision 


0x01 
00 




Implementation version 1.0. 



octlet 


bits 


field name 


value ranae 


interoretation 


2 


63-. 16 


manufacturer 
code 


pxOO 

40 

a3 

a4 

5d 

ff 




Identifies initial manufacturer of 
Calliope interface device 
implemented by MicroUnity as 
MicroUnity. 




15..0 


manufacturer 
revision 


0x01 
00 




Manufacturing version 1.0. 


octlet 


bits 


field name 


value ranoe 


interoretation 


3 


63. .16 


serial 
number 


0 




This device has no serial number 
capability. 




15..0 


dynamic 
address 


0 




This device has no dynamic 
addressing caoability. 
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bits 


field name 


value rar.re 


intercretaiion 


53. 60 


A 


4 


0..15 


size of a Hermes address 


59..56 


log2W 


3 


0.15 


size of a Hermes word 


55. .43 


C 


11 


0..25 
5 


IOQ9 of buffpr r^n^ritxi in \Mr\rric 


47..0 


0 


0 


0 


J ic^oci vc;u lui utjiitiiilun in laier 
revision of Calliooe arrhitprtnrp 


bits 


fisld name 


value ranee 


iniercretation 


63.20 


0 


0 


0 


Reserved for definition in later 
revision of Calliooe architecture 


19.. 18 


Al 


1 


D..3 


number of Al audio inputs 


17.. 16 


AO 


1 


0..3 


number of AO audio outputs 


15..14 


PO, Pi 


1 


0..3 


number of PO phone outouts and Pi 
phone inouts 


13.. 12 


Vi 


1 


0.3 


number of VI video inputs 


11.10 


VO 


1 


0.3 


number of VO video outputs 


9.. 8 


IR, II 


1 


0.3 


number of IR infrared outputs and 11 
infrared inputs 


7.6 


SO, SI 


1 


0..3 


number of SO smartcard outputs and 
SI smartcard inputs 


5..4 


EQ, CI 


2 


0..3 


number of EQ equalizers and CI 
cable inputs 


3.2 


CO 


2 


0.3 


number of CO cable outputs 


1..0 


QPSK 


1 


0..3 


number of QPSK cable inputs 
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ccriet 
6 



bus 
63 
62 
61 

60 



59..50 
49.-48 
47. .33 
32 



31.. 16 
15. .8 



7..0 



■ & 


1 


n 1 


set to invoke device's circuit reset 


clear 


1 


0..1 . 


set to invoke device s loolc clear 


selftest 


0 


0..1 


set to invoke device s selftest: bits 
60 . 48 mav indicate deoth of selftest 


defer writes 


0* 


0..1 


set to cause wntes to octlets 25.. 43 
to be deferred until the next logic- 
clear or non-deferred write. 


0 


0 


0 


Reserved 


module id 


0 


0..3 


Module identifier. 


0 


0 


D 


Reserved 


Hermes 
channel 
disable 


1 


0..1 


Set to cause Hermes input channel 
to be ignored and idles to be 
generated on output channel. 


0 


0 


0 


Reserved 


cidle 0 


0" 


0..25 
5 


Value transmitted on idle Hermes 
output channel when output clock 
zero (0). 


cidle 1 


255' 


0..25 
5 


Value transmitted on idle Hermes 
output channel when output clock 
one(l). 



295 



Case 2:05-cv-00505-TJW Document 149 Filed 10/15/2007 Page 19 of 40 

wo 97/07450 PCT/US96/13047 



Qcrlel bits 
7 63 



62 



61 
60 

59..57 
56 

65 

54 

53. 



52.. 16 
15. .8 

7.0 



reset/clear/ 
selftest 
complete 


1 


0.1 


ihis bir IS set when a reset, clear or 
selftest operation has been 
comoleted. 


reset/clear/ 
selftest 
status 


1 


0.1 


This bit is set when a reset, dear or 
selftest operation has been 
completed successfully. 


meltdown 
detected 


0 


0..1 


This bit is set when the nnelldown 
detector has caused a reset. 


low voltage 
or 

temperature 


0 


0..1 


This bit is set when the voltanp nr 
temperature is too low for proper 
operation of logic circuits. 


0 


0 


0 


HeSfifV^rl fnr inrfir*atinn orlHirinir^i 
' vcvj iKji iMuicaiUiy aUUILIunai 

causes of reset. 


Cerberus 
transaction 
error 


0 


0..1 


This bit is set when a Cerberus 

It ai ioai.rfUUi 1 cllUi ilod CaUSeCJ a 

machine check. 


Hermes 
check byte 
error 


0 


U..1 


This bit is set when a Hermes 

channel ChPrk hvfo ormr hae r^^nc^^M 

a machine check. 


Hermes 
command 
error 


0 


0..1 


This bit is set when a Hermes 
channel command error has caused 
a machine check. 


Hermes 
address error 


0 


0..1 


This bit is set when a Hermes 
address error has caused a machine 
check. 


0 


0* 


0 


Reserved 


raw 0 




0..25 
5 


Value sampled on specified Hermes 
channel when input clock is zero (0) 


raw 1 




0..25 
5 


Value sampled on specified Hermes 
channel immediately following 
sample value in raw 0 register. 



octlet bits 
8. .9 63..0 [ 



field name value ranoe 



|0 |0 [Reserved" 



Interoretatton 
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10 



ocUet 
1 1 



bits 
63. .56 
55. .48 
47 .40 
39.32 
31. .24 
23. 16 
15. .12 
11. .8 
7.. 4 

3 

2 

1 



bits 
63.. 56 
55 
54 
53 
52 
51 
50 
49. .48 
47. .40 
39..36 
35..32 
31. .28 
27 
26 
25 

24 

23.. 16 
15. .8 

7..0 



0 


0 


0 


Reserved 


PLL anob 






PLL analog-knob settings 


0 


H 




Reserved 


CI2 test 




0.7 


CI2 test ccnirol 


CI1 test 


0 


0.7 


CM test control 


CI2adc anob 


224 




CI2 ADC analoa-knob settinas 


C12Q filter 


3 


0. . / 


CI2 0 filter adjust 


CI 21 filter 


3 


C..7 


CI2 1 filter adjust 


0 


0 




Reserved 


Ci2 VCO 


0 


0..1 


CI2 external VCO switch 


CI2 LNA 


0 


J.. ' 


CI2 incur LNA enable 


CI2Q ADC 
preamplifier 


0 




CI2 0 ADC preamplifier disable 


Ci2l ADC 
preamplifier 


0 


D..1 


CI2 1 ADC preamplifier disable 



Cllsyn anob 


/arutr 

224 




interofctation 

CM synthesizer analog-knob seninas 


C02 invert 


0 


0..1 


C02 inversion control 


C01 invert 


0 


0..1 


COr inversion control 


CI2a invert 


0 


0..1 


CI2a inversion control 


CI2b invert 


0 


0..1 


Cl2b inversion control 


CI la invert 


0 


0..1 


CM a inversion control 


CI lb invert 


0 


U..1 


Cllb inversion control 


0 


0 


0 


Reserved 


Clladc anob 


224 




CM ADC analoq-knob settings 


CilQ filter 


3 


0..7 


Cn Q filter adjust 


CI1I filter 


3 


0..7 


Cn 1 filter adjust 


0 


0 


0 


Reserved 


CI1 VCO 


0 


0..1 


cn external VCO switch 


CI1 LNA 


0 


0..1 


cn input LNA enable 


CI1Q ADC 
preamplifier 


0 


0..1 


cn Q ADC preamplifier disable 


cm ADC 

preamplifier 


0 


0..1 


C1 1 1 ADC preamplifier disable 


CI2syn anob 


224 


C..230 


CI2 synthesizer analog-knob settinas 


refcik anob 


224 


C..230 


reference clock divider analog-knob 
settings 


CLIO anob 


224 


C-..230 


CLIO analog-knob settings 
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13 



bits 


ne'd name 






nte.'cratation 




capacitor 

wdiiurddVii 


r\ 
U 


u.. 1 


Set to enaDie capacitor calibration. 


62. .56 


calibration 
result 


0 


0..12 
7 


— _ 

Result of capacitor calibration. 




0 


0 


0 


Reserved 


33 


VI invert 


0 


0..1 


VI inversion control 


32 


VO invert 


0 


0..1 


VO inversion control 


31. .24 


VI anob 


224 


C..230 


VI analog-knob settings 


23. 16 


VO anob 


224 


[: 230 


VO analog-knob settings 


15. .3 


C01 anob 


224 I- 230 


COI analog-knob settings 


7..0 


C02 anob 


224 p 230 


C02 analog-knob senings 


DttS 


field name 


vaius 




interoretation 


oo 


u 


0 


c 


Reserved 


0^. .DO 




0 


C..12 
7 


C02 configuration control 


55 


Ml iiivork 


0 


C..1 


Al inversion control 


54 


f 1 inverx 


0 


0..1 


PI inversion control 


53 


inverx 


0 


0..1 


PO inversion control 


52 


inverx 


0 


b-.i 


AO inversion control 


51 ..50 


A 1 E2 l%i^^ 

Min Dias 


2 


0..3 


Al right amplifier bias level 


51 .-48 


AIL hia« 
Mil. Dias 


2 


0..3 


Al left amplifier bias level 


47. .40 




224 


C..2j0 


Al right analog-knob settings 


39. .32 


AIL anob 


224 


C..230 


Al left an?5lnn-knnh Qottinnc 


31. .26 


0 


0 


0 


Reserved 


25. .24 


PI bias 


2 


0..3 


PI amplifier bias level 


23.. 16 


PI anob 


224 


C .230 


PI analog-knob settings 


15.. 13 


0 


0 


0 


Reserved 


12 


mute 


1 


0..1 


AO and PO mute 


11. .8 


PO filter 


7 


0..15 


PO antialias filter adjust 


7..4 


AOR filter 


7 


0..15 


AO right antialias filter adjust 


3..0 


AOL filter 


7 


0..15 


AO left antialias filter adjust 
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bris 


fiGJd name 


value 


'3 ■ o 




63. .56 


0 


0 


P 


Heserved 


55. .48 


EQ2 test 


0 


P' 




47. .40 


EQ1 test 


0 




E01 test control 


O 3 


0 


0 10 


Reserved 


38..32 


C01 

configuration 


0 




COl configuration control 


31. .16 


left priority 


p. .65 
B36 


left priority 


15..0 


right priority 


p.. 65 
1536 


nqht Dfioritv 


bits 


field name 


value 




interoreiation 




Q 


Q 


0 


Reserved for expansion of Cerberus 
registers upward or knobcity registers 
uownwarci. 


bits 


field name 


value 


rar 3e 


intPrrrctaftr»n 


63..56 




224 




geograohical diaital knob seuinos 


55..48 




224 


C..-27 


geographical digital knob settinos ^ 


47. .40 




224 




geographical djqital knob seuinqs 


39. .32 




224 


0.,:27 


geographical digital knob settinos 


31. .24 




224 


Z'r.27 


geographical digital knob settings 


23. .16 




224 


0.M7 


geographical digital knob settings 


15. .8 




224 


a. -27 


geographical digital knob settings 


7..0 




224 


0.127 


geographical digital knob settings 


bits 


field name 


value ranae 


interoreiation 


63. .56 




224 


C. 127 


geographical digital knob settings 


55..48 




224 


Cv.*.27 


geographical digital knob settings 


47. .40 




224 


0.127 


geographical digital knob settings 


39..32 




224 


0..127 


geographical digital knob settings 


31..24 




224 


0..127 


geographical digital knob settings 


23.. 16 




224 


0..127 


geographical digital knob settinos 


15..8 




224 


0.127 


geographical digital knob settings 


7..0 




224 


0..12? 


geographical digital knob settinos 
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bits 
63. .56 
55. .48 
47. .40 
39..32 
31. .24 
23 .16 
15. .8 
7..0 

bits 
63. .56 
55. .48 
47. .40 
39.. 32 
31. .24 
23.. 16 

15. .8 

7.0 



•retd name 



bits 
63..56 
55-. 48 
47. .40 
39..32 
31. .24 
23.. 16 
15. .8 

7..0 

bits 
63..56 



55..48 
47.,40 
39..32 
31..24 
23.. 16 
15. .8 
7.0 



vaf-e .'E--i 



224 



22^ 



22' 



intercretalron 



C -1 



' Igeographical digital knob spmnnc 



22 J 



C -2 



224 



[geographical digital knob seuinos 
IgeoQfaphical digital knob senmas 



- Igeoqraohical digital knoh .^Ptrinnc 



• '^"loeooraDhical digital knob settings 



'4 - Igeo graDhicai digital knob settings 
^ h beo graohical digital knob settings 
^ t- loeoaraphical digital knnh settings 



field name 



value '=-'3= 




interpretation 

^ Igeographical digital knob settings 



Igeographical digital knob settings 
Igeograph ical digital knob settings 



[geographical digital knob seninos 



Igeographic al digital knob ssttinns 
Igeographical digital knob settings 



geographical digital knob settings 



geographical digital knob settings 



lield name 



vai'je f 2"-=* 



interpretation 



224 ^- 2- Igeographical digital knob sellings 
geographical digital knob settings 



224 



224 



geographical digital knob settings 



geographical digital knob settings 
geographical digital knob settings 



224 



geographical digital knob settings 



224 C.--2' 



geographical digital knob settings 



Igeographical digital knob settings 



Hermes 
channel 
knob 


value 
5 


rar^s 
1..127 


inieroretation 
knob settings for Hermes channel 
circuits. 




224 


C--.2? 


geographical digital knob setting's 




224 


0..:27 


geographical digital knob settings 




224 


3/127 


geographical digital knob settings 




224 




geographical digital knob settings 




224 


Z.A27 


geographical digital knob settings 




224 


Q..127 


geographical digital knob settings 




224 


p.. 127 


geographical digital knob settings 
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Ultb 

63. .62 


fjeiu name 

Hermes 
skew swing 


•/3IU 

0 


9 r3 

0. 


"""^ intercr^iaiion 

.3 Voltage swing selection for Hermes 
channel skew circuits 


61. .60 


0 


0 


0 


Reserved ' 


59..57 


resg 


5 


0. 


.7 Global resistor mask for all knobs 


56..53 


0 


0 


0 


Reserved 


52-48 


termination 
fine-tuning 


20 


0. 


.3l:Set based on value read from Ph/OS i 
drive strength, used to fine-tune = 
resistor values in Hermes 
termination. i 


47. .45 


0 


0 


0 


Reserved 


44. .40 


process 
control 


20 


0. 


.31 Set based on value read from PMOS i 
Idrive strength, used to fine-tune i 
resistor values in knob settings. • 


39. ,37 


0 


0 


0 


Reserved 


36. .32 


PMOS drive 
strength 


« 


.0. 

r 


31 This read/only field indicates the 
drive strength of PMOS devices 
expressed as a digital binary value. 


31. .28 


swing 3 


15 


;0. 


i5Voltaae swing knob setting 3 ; 


27..24 


swing 2 


15 


0. 


IS.Voitaqe swing knob setting 2 i 


23..20 


swing 1 


15 


;o. 


ISiVoltage swing knob setting 1 ; 


19.. 16 


swing 0 


15 


'0. 


.15'Voltaae swing knob setting 0 i 


15. .2 


reference 3 


15 


iO. 


-ISiVoltaqe reference knob settina 3 ; 


11. .8 


reference 2 


15 


;0..15'Vottaae reference knob settinn ? \ 


7..4 


reference 1 


15 


10. 


ISiVoltage reference knob settina 1 


3..0 


reference 0 


15 


'0..15iVoltage reference knob settina 0 1 
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OCtle! bits 

31 63.58 
57 



55 



55. ,51 
50 

49 

48 

47. .43 
42 

41 

40 

39.-35 
34 

33 

32 
31. .24 



field name 


value 




inieTDreraiicn 


0 


0 


0 




PLL 

prescaier 
bypass 


0 


0..1 


Set to invoke PLLO and PLLi 
prescaier bypass, otherwise divide 


conversion 
prescaier 
bypass 


0 


0.,1 


Set to invoke temperature conversion 
prescaier bypass, otherwise divide 
input clock by 20. 


ruu^ Qiviae 
ratio 


OA 


1 ,.31 


PLL2 divider ratio 


PLL2 

Teedback 
bypass 


1" 


0..1 


Set to invoke PLL2 feedt)ack bypass. 


PLL2 range 


0' 


C..1 


Set for operation at high frequency 
(above O.xxx GHz); cleared for 
operation at low frequency (below 
O yyy GHz), 


PLL2 
oscillator 
select 


0 


0..1 


Set to select multivibrator oscillator; 
cleared to select ring oscillator. 


PLL1 divide 
ratio 


12 


3.. 13 


PLLI divider ratio 


PLL1 

feedback 
bypass 


1* 


0..1 


Set to invoke PLL1 feedback bypass. 


PLL1 range 


0* 


0..1 


Set for operation at high frequency 
(above O.xxx GHz); cleared for 
operation at low frequency (below 
O.yyy GHz). 


PLL1 
oscillator 
select 


0 


0..1 


Set to select multivibrator oscillator; 
cleared to select ring oscillator. 


PLLO divide 
ratio 


12 


6..13 


PLLO divider ratio 


PLLO 
feedback 
bypass 


1 


0..1 


Set to invoke PLLO feedback bypass. 


r^^v rangv 


n 


U.. 1 


Set for operation at high frequency 
(above O.xxx GHz); cleared for 
operation at low frequency (below 
O.yyy GHz). 


PLLO 
oscillator 
select 


0 


0..1 


Set to select multivibrator oscillator; 
cleared to select ring oscillator. 


analog 
measurement 


0 


0..25 
5 


Set to measure analog levels at 
various test points within device. 
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ocrlet 
32 



23. .22 
21 



20 
19 .16 

15.. 10 
9..0 



octlet 
33.63 



bits 
63 
62 



61 



60 

59..57 

56..54 

53. .48 
47..42 
41. .36 
3S..30 
29. .24 
23..18 
17. .12 
11. .6 
5.0 

bits 
63..0 



meltdown 
threshold 


0 


0..3 


oei lo cerrorm iTtargin testing of the 
meltdown detector. 


conversion 
start 


0* 


0..1 


Setting tfiis bit causes the 
conversion to begin. The bit remains 
set until conversion is complete 


0 


n 


u 


Reserved, (selection extension) 


conversion 
selection 


0* 


0.9 


Field selects which of ten 
measurements are taken 


0 


0 


0 


Resen/ed. (counter extension) 


conversion 
counter 


0' 


0..10 
23 


This field is set to the two's 
complement of the downslope count. 
The counter counts upward to zero, 
and then continues counting on the 
upslope until conversion comnlPtPQ 


field aame value ranee intomrptarinn 


o 


n 


r\ 
u 


Reserved 


quadrature 
bypass 


0- 


0..1 


Setting this bit causes the 
quadrature circuit to be bypassed; 
the input clock signal is used 
directly. 


quadrature 
range 




0..1 


Set to 0 if the Hermes channel is 
operating at a low frequency; 1 if the 
Hermes channel is operating at a 
high frequency. 


output 
termination 


1 


0..1 


Set to enable output terminators. 
Cleared to disable output 
terminators. 


termination 
resistance 


1 


0..7 


Set termination resistance level. 


output 

current 


1 


0..7 


Set output current level. 


skew bit 7 


1 


p.-63 


Bet delay in Ho7 skew circuit. 


skew bit 6 


1 


D..63 


Set delay in Ho6 skew circuit. 


skew bit 5 


1 


b..63 


[Set delay in Ho5 skew circuit. 


skew bit 4 


1 


p..63 


Set delay in Ho4 skew circuit. 


skew bit 3 


1 


D..63 


Set delay in Ho3 skew circuit. 


skew bit 2 


1 


D..63 


$et delay in Ho2 skew circuit. 


skew bit 1 


1 


D..63 


Set delay in Hoi skew circuit. 


skew bit 0 


1 


D..63 


Set delay in HoO skew circuit. 


skew elk 


1 


p.. 63 


Set delay in HoC skew circuit. 


field name value ranae interoretation 


0 


0 


0 . 


Reserved for use with additional 
Hermes channel interfaces 
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ocriei Diis 
. 53. .0 
65536 



field name 



value ''5.'". "5 



r- 1- nceroretatton 

0 p Ressrvsd for use with [ater revisions 
I lof the architscturfi 



configufat:cr. memory soace 



NlicroUnity's company idencifier is: 0000 0000 0000 0010 1 100 0101. 





Code number 1 


CallioDe 


OXUD 40 a3 92 b4 49 



CaUiope architecture revisions are specitied by the following table 



I Internal code name 


Code numcer | 




UxOl 00 



NUcroUnity's Calliope implememor codes are specified by the foUowing table 



1 Internal code name 


Code number | 


iMicroUnity 


0x00 40 a3 49 db 3c | 



Internal code name 


Revision number 


1.0 


0x01 00 







j Internal code name 


j Code number | 


IMicroUnity 


0x00 40 aS a4 6d ff 1 



MicroUnitys Calliope as implemented by MicroUniiv, and manufactured bv 
iVUcroUmty, uses manufacturer revisions as specified by the foUowing table: 



Internal code name 


Code number 


1.0 


0x01 00 
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The architecture description registers in octlets -I and 5 compiv u-ith the Cerberus 
specihcation and contain a machine-readable version of the architecture 

ni%Tr'' t'f ' ^l^h so, SI, EQ. CI. CO. and 

QPSK described tn this document. • nti . d.iu 

The architecture parameters describe characteristics of the Hermes interface 
° Calhope butter memor>-. and the number of audio, phone "^Jeo 

Centra! P^^ofc^tRr 

^H, 'r °u'" ^ " ' ^^ "^ both read and vvate access, 

to thi rcoister ' accesses: Calliope does not alter the values written 

The reset bit of the control register complies v^iih the Cerberus specification and 
provides the ability to reset an individual Calliope device in a svLm. Writing a 
one (1) to this bit IS equivalent to a power-on reset or a broadcast Cerberus reset 
(low level on SD for 3) cycles) and resets configuration registers to their power-on 
values, which is an operating state that consumes nominal current (as determined 
bv external pins), and also causes aU internal high -bandwidth logic to be reset 

is sufficient for the operating state changes to have taken 
ettect At the compleuon ot the reset operation, the reset/dear/selftest complete 
bit of the status register is set. the reset/clear/selftest status bit of the status 
register is set, and the reset bit of the control register is set. 

The dear bit of the control register complies with the Cerberus specification and 
provides the ability to clear the logic of an individual Calliope device in a system. 
Writing a one (1) to this bit causes all internal high-bandixidth logic to be reset, as 
IS required after reconfiguring power and saving levels. TTie duration of the reset is 
suthcicnt tor any opcratmg state changes to have taken effect. At the complerion 
of the reset operation the reset/dear/selftest complete bit of the status register is 
set, the reset/dear/selftest status bit of the status register is set. and the clear bit 
oi the control register is set. 

The selftest bit of the control register complies with the Cerberus specification 
and provides the ability to invoke a selftest on an individual Terpsichore device in 
a system However, Calliope does not define a selftest mechanism at this ume so 
setung this bit wiU immediately set the reset/dear/selftest complete bit and the 
reset/dear/selftest status bit of the status register. 

The defer writes bit of the control register provides a mechanism to adjust several 
octlets of Cerberus registers at one time with a single transirion, such as when 
setting mdividual power levels within CaUiope. Writmg a one (1) to this bit causes 
writes to octlets 10 through 32 to have no effect (to be deferred) until the ne.xt 
logic-clear or a non-deferred write. When writes have been deferred, the values 
written are lost if a read of these octlets precedes the subsequent logic-clear or 
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?n';h'^'^"^'o'"i!'l- or non.deterred urue occurs when writing to octlets 

10 through 32 while the defer writes bit is cleared (0). 

The module id-f.eld ot the control register controls the value of the module 
•dentiher t.eld ot the Hermes input channel which selects diis CalUope d^-iTe 

The Hermes channel disable bit of the control register provides the means to 
begin operations on the Hermes channels after a reset, dear, or error SVa 

to b U^ei^d on'r h'^ """" -put channel to be ignored and" forci idle 
to be generated on the Hermes output channel. Writing a zero (0) to this bit 
causes the Hermes input channel phase adjustment to be rese and af«r a 
suitable delay the Hermes channels are available for use. 

Seatet- sen^t'^simnr. ^ '2i""°' P^^"^^ a mechanism to 

repeatedK sent simple patterns on the Hermes output channel for Durooses of 

^(Srandlhrcli'S r"^' "^'^ « eelcTLSrset 0 

zero lui, ana the cidle 1 tield must be set to aU ones (255). 

T'\^u ^'^il '^^'"'^ ^'"^ ^-^^e access, though the 

only legal value which may be written is a zero, to clear the register The result of 
writing a non-zero value is not specified. 

The reset/clear/selfiest complete bit of the status register complies with the 

S^arnrdlrcrdSlr" --^^-^^ cMf«s: 

Itc^fVcttinl'l'^A*^^'"* """'i!"" °' '^'."f^"' '^8'"" '^"'P^^' ^^"h the Cerberus 
specif cation and is set upon the successful completion of a reset, clear or selftest 
operation as described above. »a"cbi 

detecror'hirdl. '^""^'^'^ ^" '^"^ ^« ^^'h^" 'he meltdown 

detector discovered an on-chip temperature above the threshold set bv the 

^s« trocc^rnSl^ ^'"^"^ ° ^"i^T which causes a 

reset to occur and the power level to be forced to minimum (1). 

Seuir^rhZ'T " of the status register is set when internal 

circuits have detected either insufficient voltage or temperature for proper 
operation of high speed logic circuits, which causes a logic clear until the condition 
IS no longer detected (due to an increase in supply voltage or device temperature). 

The Cerberus transaction error bit of the status register is set when a Cerberus 
transaction error (bus timeout, invalid transaction code, invalid address) has 
occurred. Note that Cerberus abons. including locally detected paritv errors 
should cause bus retries, not a machine check. ' 

The Hermes check byte error bit of the status register is set when a Hermes 
check byte error has occurred. 
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The Hermes command error bit of the status register is sec u-hen a Hermes 
command error has occurred. 

The Hermes address error bit of the status register is set when a Hermes address 
error has occurred. 

The raw 0 and raw 1 fields of the status register contain the values obtained from 
two adjacent samples of the specified Hermes input channel. The raw 0 field 
contains a value obtained when the input clock was zero (0). and the raw 1 field 
contains the value obtained on the immediately foUowing sample, when the input 
clock was (1) Calliope ensures that reading the status register produces two 
adjacent samples regardless of the timing of the status register read operation on 
Cerberus. These fields are read tor purposes of testing and control of skew m the 
rlermes channel interfaces. 

Power a nd Sw/nr? Calihmtjnr^ ^(=^ni.c;t^r<^ 

Calliope uses a set of calibration registers to control the power and voltage levels 
used for internal high-bandwidth logic and memory. The details of programming 
these registers are described below. 

Eight-bit fields separately control the power and voltage levels used in a portion of 
the CaUJope circuitry. Each such field used to control digital circuitrv (labeled 
knob ) contains configuration data in the following format: 
7 5 d 

L 



3 2 0 

res I ref \ lv| | Q | 



power and swino controls 



Each such field used to control analog circuitry 
configuration data in the foUowing format: 



(labeled "anob") contains 



L 



res 



54 



3 2 



Ivl 



0 

12 



power and swing controls 



The range of valid values and the interpretation of the fields is given by the 
following table: 



field 


value 


interoretation 


0 


0 


Reserved 


ref 


0..3 


Set reference voltage level 


ivi 


0..3 


Set voltage swing level. 


res 


0..7 


Set resistor load value. 



Power and swing control field interpretation 
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The reference voltage level, voltage swmg Icvd ,nd resistor load value are model 
figures for a full-sunng. lowesc-power logic gate output. The actual voltase levels 
and resistor load values used in various circuits is geometricaUv related to the 
values in the tables below. Designed typical, full-speed settings tor the ref IvI and 
res tields are ref=250 millivolts. lvl=500 millivolts, and res=2.5 kilohms 

Jnnrr'y /k"'"^' i^^^'^^' '7^ " Configuration register, 

domain. The value of the ref tield is interpreted bv the foUovvina table- 



ref 


reference voltage level 


0 


reference 0 


1 


reference 1 


2 


reference 2 


3 


reference 3 



I. vllai c' swing n helds of the configuration register, control 

the voltage wing level used for logic circuits in the specified knob "domain The 
value of the Ivl field is interpreted bv the loUovving table- 



Ivl 


voltage swing level 


0 


swing 0 


1 


swing 1 


2 


swing 2 


3 


swing 3 



Swino",lw u""?"' °^ n and reference n fields are given bv the 

toUowing table, with units m millivolts: 



value 


reference 


swing 


0 


138 


275 


1 


150 


300 


2 


163 


325 


3 


175 


350 


4 


188 


375 


5 


200 


400 


6 


213 


425 


7 


225 


450 


8 


238 


475 


9 


250 


500 


10 


263 


525 


11 


275 


550 


12 


288 


575 


13 


300 


600 


14 


325 


650 


15 


350 


700 
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mdtdol?^^"/°"^ K '"^ ''^^'^ °' configuration register and the 

meltdown detected b.u oi the status register, control the PMOS iSad resistance 

S ue Fo?et'h rS'rTri' '^^"^^'^ ^he resl 

\aiue. her each res tield, the resl value is computed as: 

resl = res & (meltdown detected ? 1 : resg) 

Lmrof t^e P\Kt"7'''' '^' ""'^M'^l^ oi the configuration register. 

knoK Hn^ load resistance value used tor logic circuits in the specified 

knob domam. Va ues and mterpretac ens of the Ivl field are given bv the following 

proc j;treteTs.^^^ "''^ 





process control 


resl 




4 


8 


12 


16 


1 20 


24 


28 


0 






undefined 


1 




2.5 


5.0 


7.5 


10. 


13. 


15. 


18. 


2 




1.3 


2.5 


3.8 


5.0 


6.3 


7.5 


8.8 


3 




.83 


1.7 


2.5 


3.3 


4.2 


5 


5.8 


4 




.63 


1.3 


1.9 


2.5 


3.1 


3.8 


4.4 


5 




.50 


1.0 


1.5 


2.0 


2.5 


3 


3.5 


6 




42 


.83 


1.3 


1.7 


2.1 


25 


2.9 


7 




.36 


.71 


1.1 


1.4 


1.8 


2.1 


2.5 



interpretation 

When the process control fidd of the configuration register is set equal to the 
PMOS drive strength field of the configuration register, nominal PMOS load 
resistance values are as given by the foUo\nng table, with units in kiiohms 



res 


PMOS load resistance 


0 


undefined 


1 


13. 


2 


6.3 


3 


4.2 


4 


3.1 


5 


2.5 


6 


2.1 


7 


1.8 



When iVInemosyne is reset, a default value of 0 is loaded into each 0 field 0 in 
each ref field, 0 m each Ivl field and 7 in each res field, which is a bvie value of 
224. The process control field of the configuration register is set to 20, and the 
reference n and swing n fields are set co 15. These settings correspond to a chip 
wtth nominal processing parameters, nominal power and high voltage swing 
operation. " 

For noimnal operating conditions, the ref field is set to 0. the IvI field is set to 0. and 
the res held is set to 5, which is byte value of 5. The process control fidd is set 
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cquoJ to the PMOS strength field, and the reference n and "swing n fields are sec to 



interface Confioi jrp.r inn ^so:s-^r=: 

Interface configuration registers are provided on the Calliope interface to control 
the iinsen summary hst of controls], """».c lo coniroi 

The CIl test and CI2 test field of interface configuration resistcr 10 control 
operating modes ot the CIl and CI2 cable input blocks. 

S'^'k^^'u '"P^'.'^^y ^°'?"°' the operating modes of the cable input blocks. 
Each such field contains contiguration data in the follo%ving format: 

Z .1 5 1 n 



0 

T" 



rotate 



round 



test 



DSP 

enable 



cable inpui test controls 



foUou-bltabk''"'^ interpretation of the fields is given by the 



field 


value 


mterorstation 


0 


0 


Reserved 


rotate 


a.1 


Set to enable rotator 


round 


0..1 


Set to enable multiplier rounding 


test 


0..1 . 


Set to bypass ADC and connect 
cable input to cable output (digital 
looD back) 


DSP enable 


0..1 


Set to enable DSP output (clear to 
enable testing of RAM) 



The CIIQ filter. Qll filter. CI2Q filter and CI2I filter fields of interface 
configuration register 10 and 11 control the cutoff frequency of the cable input 
aniialias filters. ' *^ 

Four-bit fields separately control the cutoff frequency of each cable input andaUas 
tilter. Each such field contains configuration data in the following format: 

3 2 n 

I 0 I cutoff 



cable input antialias filter controls 
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The range of valid values and the interpretation of the fields is given bv the 

tollowino table: 



field 


value 


interDreiaticn 


0 


0 


Resen/ea 


cutoff 


0.7 


Cutoff frequency selection for 
antialias filter 



Cable input anlialias filter control fisid interpretation 

Values and interpretations of the cutoff fields are given bv the follou-ing table 
with units in megahertz, for nominal 3 dB frequency at specified junction 
temperature: * 



cutoff 1 25 C 


75 C 


125 C 


0 


14.1 


13.8 


13.4 


1 


11.9 


11.7 


11.4 


2 


10.4 


10.2 


10.0 


3 


9.2 


9.1 


8.9 


4 


8.3 


8.2 


8.0 


5 


7.6 


7.5 


7.3 


6 


7.0 


6.9 


6.7 


7 


6.4 


• 6.4 


6.2 



For normal operation a value of 3 is placed in the cutoff fields, selecting a 9 NiHz 
cutoff frequency. 

The CIl VCO and CI2 VCO bits of interface configuration registers 10 and 11 
conrrol the selection of the VCO used as an input to the tuner of the cable input. 
Writing a zero (0) to the bit selects the internal VCO, while writing a one (1) 
selects an external VCO input. In normal operation a zero is placed in the VCO 
bit, selecting the internal VCO. 

The CU LNA and CI2 LNA bits of interface configuration registers 10 and 11 
enable the LNA (low noise amplifier) used as an input to the tuner of the cable 
input. Writing a zero (0) to the bit disables the LNA, while writing a one (1) 
enables the LNA. In normal operation a one is placed in the LNA bit, enabling the 
LNA. 

The CIIQ ADC preamp, CIII ADC preamp, CI2Q ADC preamp and CI2I ADC 
preamp bits of interface configuration registers 10 and 11 enable the ADC 
preamplfier output used as an input to the ADC of the cable input. Writing a zero 
(0) to the bit enables the ADC preamplifier output, while writing a one (1) disables 
it, allowing the tuner input to be driven from an external pin. In normal operation 
a zero is placed in the ADC preamp bits, enabling the preamplifiers. 

The Clla, Cllb, CI2a, CI2b. COI , C02, VI, VO, AI. AO, PI, PO invert bits of 
interface configuration registers 11, 12, and 13 provide for the selective inversion 
of the relative clock phase of the analog-to-digital section internal interfaces in the 
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respective interfaces. In normal operation, a zero is placed in the invert bits 
matching the relative phases ot the in:erface sections. 

The COl configuraiion and C02 configuration fields of mterface configuration 
registers 1^ and 14 provide for the contiguration of external devices which "assist in 
the implernentation of the cable output. The configuration fields drive LVTTL 
outputs ivhich can control external tilters and other components. In nornal 
operation, a zero is placed in the configuration fields. 

The PI bias. AIR bias and AIL bias fields of interface configuration register 13 
ampSie'rs " ° '"P"' ^"'^ operational 

Four-bit fields separately control the bias current of each input operational 
amphfier. Each such field contains configuration data m the foUovdng format: 

,1 0 

I bias 



J 



audio input operational amplifier controls 



Slmdng^table''''"'' ^""^ interpretation of the fields is given by the 



held 


value 


interpretation 


bias 


0..3 


bias current selection for input 
operational amplifier 



Values and interpretations of die bias fields are given by the foUowing table, u-ith 

units m micrOflmnprf-«c fnr nnmiri'al /«tiv*Ar^» ,.^^-:r- J : 



bias 


25 C 


75 C 


125 C 


0 




200 




1 




133 




2 




100 




3 




80 





The mute bit of interface configuration register 13 provides for initial muting of 
the audio and phone outputs during initial system operation. Writing a zero (0) to 
the bit enables the audio and phone outputs, while u-riting a one (1) forces the AO 
and PO outputs to a constant value (zero with AC coupling). 

The PO filter. AOR filter and AOL filter fields of interface configuration register 
13 control the cutoff frequency of the phone and audio output right and left 
aniialias filters. 
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tlherEU'lth'ST''^' cutoff frequency of each output antialias 

tater. tach such field contains conhguration data in the foUou-ina format: 



c 



cutoff 



J 



:able Incut antialias filter controls 



tltfnftabt""' '""^ interpretation of the fields is given by the 



tteld 


value 


interoretation 


cutoff 


0.-.15 


Uutoft frequency setecnon for 






antialias filter 



Sth'unTr? ^he curoff fields are given by the follou-ins table, 

temperature ' ^ f^^^"^"^>' « '^P^'f'^d '"""io" 



cutoff 


25 C 


75 C 


125 C 


0 




69.9 




1 




65.9 




2 




62.3 




3 




58.8 




4 




55.8 




5 




53.2 




6 




50.9 




7 




48.6 




8 




46.4 




9 




44.6 




10 




43.9 




n 




41.4 




12 




40.0 




13 




38.5 




14 




37.2 




15 




35.9 





The EQl test and EQ2 test field of interface configuration register 14 control 
operanng tnodes of the EQl and EQ2 cable input equalizers. 
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fizers S J'.rr'u-' "'^^^^ °f cable input 

equalizers. Each such field contains conhguration data in the foUouing fortnat: 



cable input equalizer test controls 



1 round 




DSP 


1 


1 


1 



3 



Jl^ilhL:'^''' ""^ interpreution of ,kt Bdd, i, giv™ by ,h. 



rieid 


value 


interoretation 


0 


0 


Reserved 


round 


a.1 


Set to enable multioiier roundina 


DSP enable 


0.1 


bet to enable DSP output (clear to 
enable testing of RAM) 



Confioum tion Rc^nK^ife^r 

t^^f^'tTu^""'" « provided on the Calliope interface to control the fine- 
tuning of the Hermes channe contiguration. to control the global process 

LdTo'!.n '° ^'^'^ phase-locked loop frequency generators, 

and to control the lemperature sensors and read temperature values ' 
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The Hermes skew sw.ng tield of the configuration register control the voltaae 
suing used m the Hermes channel skex circuks. The field should ak at-s be set 
equal to the value of the Ivl subtieid or the Hermes channel knob fteld ^ 

SLo^?in'ln°i '""^f P""""^ 8^°^^^ """^I of the load 

resistor .n ail ot Calliopes high-speed logic circuits. The resg field is initiallv 
loaded trom external pmsjo a nominal poxver level (5.. and can be changed ala n 
0 a value m the range 0... to lower or ra.se the power and speed of the hSh speed 
logic cKcuits m the CaUiope device, or can be set to all ones (7) to enable comrol of 

extnal or'"r ir* '^""^ ^^'^^^ altering thetlue'n h 

external pins Calliope can be configured for low-power (0 or 1) testino in a 
restricted packaging environment. =• 

bbs 'nuinL"Tor%Mm"f ^'f ""^'S™" ^^S'^^^r controls the analog 

arr JrZ^ termination circuits, in order to 

^dTo n^n ,7"'"°"' Circuit parameters due to the manufacturing process 
condinoT ^r^^^'j^f termination resistance levels. Under normafoperating 
conditions, the value read from the PMOS drive strength field should be n-riiten 
mto the termination fine-tuning field. The interpretation of the field Tgiven bv he 



value 

0 

1-19 


termination fine-tuning 

Reserved ' 


20 


increase PMus concuctance to 20/value-nomjnal 
use KMUb loads at nominal conductance 


21-31 


decrease PMub concuctance to 20/value"nominal 



?rHn*l?^^ PmAq I conftguraiion register controls the analog bias 

settings tor PMOS loads in internal logic circuits, in order to accomodate 
variations in circuit parameters due to the manufacturing process. Under normal 
operaung conditions, the value read from the PMOS drive strength field should be 
wruten mto the process control field. The interpretation of the field is given bv the 



value 


process control 


0 




1-19 


increase PMOS conductance to 20A^alue' nominal 


20 


use PMOS loads at nominal conductance 


21-31 


decrease HMOS conductance to 20/value"nominat 
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that ^ndSLf't a'""''^ ^''^i °^ ^°"f'?""tion register is a read/onlv field 



1-1 9 
20 



21-31 



PMOS drive stren gth 

Reserved 



value/20-nominal conducranr.P 
nominal conductance 



value/20'nominal conductanrp 



Tcrp "ho" pfocessV^' PLU '1""^' "'""""S .M<«"cy of .he 

generated'- i^fhe ' al°i "^^^"-i- 3Mt"l r/rcK^^^l ^'^"^^ ^ 
reference is ac 54 MI^. wi.h presiai^fb^^a sle!l^^ 

Setting the PLLO feedback bypass bit or the PLLl feedback bvpass bit of the 

"pcraToTtK^ t^'P- PLlS^andt 

Sated to be rhf onrln 1. "'^^i 5""r"8 ^^e frequencv 

generated to be the opuonaUy prescaled reference clock. These bits are cleared 
during normal operation, and set by a reset. cleared 

The PLLO range field and the PLLl range field of the configuration register are 
ze™ 'Liu ^"n^P"^""^ range for the internal PLLs. If the^PLL r^eTsa o 
zero, the PLL wiU operate at a low frequency (below O.xxx GHz ). if the PLL ranee 
IS set to one. the PLL will operate at a high frequency (above O.xxx ci) At «set 
this bit IS cleared, as the input clock frequency is unlJnown. 

Setting the PLL prescaler bypass bit of the configuration register causes the 

fc' TK^'t^°°P I ^^^V"^ ^^^^ ^« input clock direcd^as a re We 

, K u \ , f*"^ during normal operation uith a 1.08 GHz input clock in 

which the input clock is divided by 20. and is set during normal opefat on wfth a 

Setting the conversion prescaler bypass bit of the configuration register causes the 
temperature conversion unit to use the input dock directly as a reference clock. 
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Otherwise, clearing this bit causes the input clock to be divided bv 20 before use 
as a reference clock. The reference clock frequencv of the temperature 
conversion unit is nominally 5-1 MHz. and in normal operation, this bit should be 
set or cleared depending on the input clock frequency. At reset this bit is cleared, 
as the input clock frequency is unknown. 

The meltdown margin field controls the setting of the threshold at which 
meltdown is signalled. This field is used to test the meldown prevention logic. Tne 
interpretation of the field is given by the table below with a tolerance of ±6 
degrees C. and 5 degrees C hysteresis: 



value 


meltdown threshold 


0 


150 degrees C 


1 


90 degrees C 


2 


50 degrees C 


3 


20 degrees C 



The conversion start bit controls the initiation of the conversion of a temperature 
sensor or reference to a digital value. Setting this bit causes the conversion to 
begin, and the bit remains set until conversion is complete, at which time the bit is 
cleared. 

The conversion selection field controls which sensor or reference value is 
converted to a digital value. The interpretation of the field is given by the table 
below: 



value 


conversion selected 


0 


local temperature sensor 


1 


local temperature reference 


2..15 


Reserved 



The conversion counter field is set to the two's complement of the downslope 
count. The counter counts upward to zero, at which point the upslope ramp 
begins, and continues counting on the upslope until the conversion completes. 

Hermes channel Confiauration ReoistRr<=; 

Configuration registers are provided on the Calliope interface to control the 
timing, current levels, and termination resistance for the Hermes channel high- 
bandwidth channel. A configuration register at ocdet 3 1 is dedicated to the control 
of the Hermes channel, and additional information in the configuration register at 
ocdet 3 1 controls aspects of the Hermes channel circuits in common. The Hermes 
channel configuration registers are Cerberus registers 32, where 32 corresponds 
to Hermes channel 0. 

The quadrature bypass bit controls whether the HiC clock signal is delayed by 
approximately ^ of a HiC clock cycle to latch the Hiy^.o bits. In normal, full speed 
operation, this bit should be cleared to a zero value. If this bit is set, the 
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Hirt biir " '^''"'"^ '""^ ^''^ ''""^ "^^^ ^° the 

The quadrature range bit is used to select an operating range to the quadrature 
delay circuit, f the quadrature range is set to zero, the cfrcuu'uU opera^ at a lo v 
frequency (below O.xxx GHz) it the quadrature range is set to one. the circuit 
operate at a high frequency (above O.xxx GHz). 

The output termiriation bit is used to select xvhether the output circuits are 

"Xt s«roi?-r? " ^ ^"^.^ has hig'h impedence i^ 

he bit IS set to one. the output is terminated with a resistance equal to the inout 
terminauon. At reset, this bit is set to one. terminating the output ^ 

The termination resistance field is used to select the impedence at uhkh the 

^rmTn.'r.5 T^' '1^ ?P''°"^">' channel outputs re 

terminated. The resistance level is controlled relative to the setting of the 

.rfS^is dven"L? V""' -8-"- The inle^Teidin o 



value 

0 


termination resistance 


1 

2 


250. Ohms ' " ' ' 

125. Ohml ■ 


3 


83.3 Ohms ' — 


4 


62.5 Ohms ' " 


5 


50.0 Ohms " 


6 


41.7 Ohms ' " 


7 


35.7 Ohms ' 



The output current field is used to select the current at which the Hermes 
S unit°Tn mA:'" °^ interpretation of the field is given by the table. 





output current 


0 




1 


2"TnA 


2 


4. mA 


3 


6. mA 


4 


8. mA " 


5 


10. mA ■ 


6 


12. mA 


7 


14. mA 



The output voltage swing is the product of the composite termination resistance- 
(input tennmauon resistance-i+outpui termination resistance-»)-i, and the output 
current. Ihe output voltage su-ing should be set at or below 700 mV and is 
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'"'r 'T"' '''f ^-^^.'^h Pe^n^^s a sufticicndv lou- b« error rate 
A hich depends upon the noise level m the system environment. 

The skew fields individually control the delay betu-een the mternal Hermes 
channe outpm dock and each oi the HoC and HoT .O high bandwidth outpu 
^d ?n 1 E^'^h skew held contains two three-bit vaJuel named digital skeu- 

and analog skew as shown below: 



53 



digital skew 

3 



I analog sfcew 



If the Hor .nJr^ M'"^^TT^'!;°^¥^y ^"^es mserted m the outpu, path 
I f U ^he H0/..O high-bandwidth output channel signals. The analog 

tZft" ^^'^P°-" l^^-f. ^ncl thereby control the switching delav. off 

r^!L t- T^'- ^"""^ '^T ^'f ^ P"™« ' of control over the 

relatue skew between output channel signals. Nominal values for the output delav 
tor various values of the digital skew and analog skew fields are given below 
assuming a nominal settine for rh^ H»r«-d.- ^ uwow. 



digital 


delay (ps) 


plus 


skew 




analog 






skew 


0 


0 


no 


1 


320 


yes 


2 


400 


yes 


3 


470 


yes 


4 


570 


yes 


5 


670 


yes 


6 


770 


yes 


7 


870 


yes 



analog 
skew 


delay (ps; 


0 


Reserved 


1 


??? 


2 


??? 


3 


+40 


4 


+20 


5 


0 


6 


-10 


7 


-20 



\Vhcn Calliope is reset, a default value of 0 is loaded into the digital skew and I is 
loaded into the analog skew fields, setting a minimum output delay for the HoC 
and H0/..O signals. 
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Hermes Hinh-Rnnr iwidth 'nh;=inn^.i 

MicroUnity's Hermes high-bandwidth channel architecture is designed co provide 
ultra-high bandwidth communications between devices within MicroUnitv s 
Terpsichore system architecture. 

Hermes-compliant devices include one or more bvie-wide input and output 
channels intended to operate at rates ot at least 1 GHz. These channels provide a 
packet communication link to general devices, processors, memories, and input- 
output interfaces. ^ 

Hermes high-bandwidth channels employ nine signals, one clock signal and eiaht 
data signals, usmg differemiai low-voltage levels for direct communication hom 
one device to another. The channels are designed to be arranged into a ring 
consisting of up to tour target devices and one initiator. The channels mav also ht 
extended to permit muluple initiators in a single ring. 

The Hermes interface protocol embeds read and write operations to a single 
memory space into packets containing command, address, data, and 
acknowledgement. The packets include check codes that will detect single-bit 
transmission errors and multiple-bit errors with high probabiUtv. As manv as eiaht 
operations m each device may be in progress at a time. As many as four Hermes 
dcxices may be cascaded to expand s\-stcm capacity and bandwidth. 

Hermes relics upon MicroUnity's- Cerberus serial bus to provide access to a low- 
level mechanism to detect skew in input channels, and to adjust skew in output 
channels. This mechanism may be employed by software to adapiively adjust for 
skew in the channels, or set to fixed patterns to account for fixed signal skew as 
may arise m device-to-device wiring. 

Architecture Framework 

The Hermes architecture defines a compatible framework for a familv of 
implementations with a range of capabilities. The following implementaiion- 
dehned parameters are used in the rest of the document in boldface. The value 
indicated are for MicroUnity's first implementations. 



Param 
eter 


Interpretation 


Value 


Range of legal values 


A 


'09256 words in logical memory 
space or size in bytes of a 
logical memory address 


4 


1 < A<8 


W 


size in bytes of logical memory 
word 


8 


1 <W<2iS, iog2WeZ 



Hermes devices have several optional capabilities, which are identified in the 
following table: 
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Ckt'\d r\\\i x\t 
V^dUdUllliy 


Meanina 


Master 


Capable of generating requests on output channel and 
receivina responses on input channel 


Slave 


Capable or receiving requests on input channel and 
generatino responses on outout channel 


Forwarding 


Capable of forwarding requests and responses from input 
channel to output channel 


Cacne 


Capable of storing vaiues previous:y read or written and 
returning these values on subseauent reads 



Electrical SionRllinn 

Each Hermes channel consists of a one bvte wide data path and a sinale-phase 
constant_rate clock signal. Both the data and clock signals arc differential-pair 
signals. The clock signal contains alternating zero and one values transmitted uith 
the same timing as the data signals: thus, the clock signal trequenc\' is one-half the 
channel one data rate. 

Each channel runs at a constant frequency and. contains no auxiliary control, 
handshaking, or flovv-control information. The channel transmitter is responsible 
for transmitting all nine differential-pair signals so as lo be received with minimal 
skew; the receiver is responsible for decoding the signals in the presence of noise 
and skew as may arise due to differences in die signal environment of the clock 
and of each data bit. 



A Hermes device may be capable of responding to Hermes request packets 
received on a Hermes input channel. Such a device is designated a slave device 
and musi operate the Hermes output channel ai the same clock rate as the input 
channel. A slave device must generate no more than a specified amount of 
variation in the output clock phase, relative to the input clock, over changes in 
system temperature or operating voltage. 

A Hermes device that is capable of generating Hermes request packets is 
designated a master device. A master device must be capable of generating the 
constant-frequency clock signal on the Hermes output channel and accepting 
signals on the Hermes input channel at the same clock frequency as is generated. 
In addition^ a master device must accept an arbitrary input clock phase, and must 
accept a specified amount of variation in clock phase, as may arise due to changes 
in system temperature or operating voltage. 
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Each Hermes input or output channel requires IS pads, and the associated 
L-crbcrus intertace requires an additional 6 pads. 



count 


pad 


mearuna 


18 


HiC. HiC_N. Hit q. hr - N 


Hermes inoui channel 


18 


Hoc. HoC_N. H07 0. He- r. N 


Hermes output channel 


o 


SC. SD. SNq n 


Cerberus interface 


36c + 6 




total sicnal oads 



Each Hermes input channel is terminated at a nominal 50 ohm impedance to 
ground. Each Hermes output channel is optionally terminated at the same 

Z^^kf rT" '"P"' channel. An adjustable termination impedance, 
programmable via Cerberus is recommended. 

In order to provide for planar connections without vias among Hermes devices 
S.?!?""," 7"^'' '""^^ H"""" input channels and 

nfT T"' u'^! '° P"* ass gnments which presen-e the relative ordering 
ot the conductors which connect the devices. In general, the orderina must be 
consistent on circuit boards by which devices are interconnected. The orderin' 
tixes the order ot Hermes pins encountered in a clockxvise traversal of the pins to 
hS'-N Hi?& S^-H-k M°'Ji\^- Hi2_N, Hi3. Hi3_N. Hi4. 

2 rvl u 7 T:,^^K^i^r-^-^''^'-^-"°~-^-H°^ N,Ho6.Ho5 N Ho5 
Ho4 N. Ho4. Ho3 N Ho3. Ho2_N. Ho2. Hol.N. Hoi, HoO.N, Ho6. HoC N and 
Hoc respecitiyely. No other pins, e.xcepi for low-bandwidth and'power 
connecnons which may contain vias. may be placed between these pins. 

Hermes device dies (or ihe active die of a sandwich) are generaUy designed to be 
placed on circuit boards face-do\^'n. so when viewed from the top of the circuit 
board, this becomes the ordering: 




H07 H07.N Hi7.N Hi7 ... H1O.N KhO Scli HC" 

Hermes device interfaces, die face-down on circuit board 
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from the top of the device die, this becomes the ordering: 




Hermes device interfaces, face of die 
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Hermes de -Spf 7 A ""T' ^'^''^ ' ^^"-"P onenrauon. 

Svrthelnr , T'i' ''''r ' '^^^^^'^'-'-o^h die Cspace transformer") 
ha^e the conductive. only d.e in a ace-u? orientation. For a die mounted in a face- 
up orientation^ the ordering ot the pins oi the die must be reversed so uhen 
ordenn ^'""^ '^is requires the 




**-'-■> Hi7J< hIt ~ S5j3 HiO HC.N HC 

Hermes device interfaces, die (ace»uD on circuit board 
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The following is a diagram of the Hermes and Cerberus device interfaces, for a 
device with a single pair of Hermes channel interfaces. 




Hermes device 



Hermes device interfaces 



Electrical characteristics 


MIN 


TYP 


MAX 


UNIT 


REF 


Voh: H-state output voltage HoC, H07 0 








V 


VDD 


Vol: L-state output voltage HoC. Ho? 0 








V 


VDD 


Vih: H-state input voltage HiC. Hi/ 0 








V 


VDD 


Vil: L-state input voltage HiC. Hi? 0 








V 


VDD 


Ioh: H-state output current HoC. H07 0 








nnA 




lOL^ L-state output current HoC. H07 q 








nnA 




liH'- H-state input current HiC, Hiy 0 








mA 




L-state input current HiC. H'lj 0 








nnA 




C\u' Input capacitance HiC, H17 0 








pF 




Cout: Output capacitance HoC, H07 0 








PF 
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Switching characteristics 



tec: HiC clock cycle time 



tBCH: HiC Clock high tinTe" 



tBCi: HiC clock low time 



Ibt: HiC clock transition time 



tBs: set-up time. Hi/ n valid to HiC xaio ^ 



tBH: hold time. HiC xition to Hi? n invali d 



tps: skew between HoC.and H07 n 



MIN 



1000 



400 



400 



200 



-200 



-50 



TYP 



MAX 



100 



100 



-100 



50 



UNIT 



PS 



PS 



ps 



PS 



ps 



PS 



ps 



Logical Mf ^morv Rfn ,nh ^ro 

Hermes defines a logical memory region as an arrav of 28A blocks of si^e W Kvr<.. 



8W-1 



Logical memory oraanization 



] 



coLLff .H^^M:^^^^^^^^^ ^er.een .He 

Packet fitn inti 

?!f=S^Jf °" ' ^^""""^^ ""''ol commands, most commonly 

coLZa7"'a- addresses and associated dau 0?h« 

commands mdicate error conditions and responses to the above commands. 

b^?ween\\c^er^ 't' ^"^^ ''"""^ initialization and 

bereecn packets, an idle packet, consisting of a pair of an aU-zero byte and aU-one 
byte IS transmuted through the channel. Each non-idle packet consist of two 

M oact° "T- ^"'^ Tl ^ ^^^^ °f vdue othe" than aU 

!nH , I n ^ f a clock penod in which the clock signal is zero, 
and ail packets end during a clock period in which the clock signal is one. 
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The general form of a packet is an array of bytes, without a specified byte 
ordering. The first byte contains a module address in the high-order two bits, a 
packet identifier, usually a command, in the next three bits, and a link 
identification number. The remaining bytes* interpretation are dependent upon 
the packet identifier: 

7 0 

ma|com| li?" 
byte 1 
byte 2 



byte n 
check 

8 

data 

General packet . | 

The length of the packet is implied by the command specified bv the initial bvie of 
the packet. 

The check byte is computed as odd bit-wise parity, with a leftward circular 
rotation after accumulating each byte. This aigoriihm provides detection of single- 
bit and some multiple-bit errors with high probability (1-2-^), but no correction. As 
an example, the following packet has a proper check byte: 



7 0 1 



0x61 






0x00 




□ 


0x22 




□ 


0x11 




□ 


0x00 




□ 


0x86 




□ 


a 


1 


data 


c 



1 




1 
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The check byte in this example is calculated as: 



binary 


hex 


notes 


01 100001 


61 


First byte 


1 1000010 


c2 


shift left circular 


00000000 


00 


second byte 


11000010 


c2 


xor above Wjo rows 


10000101 


85 


shift left circular 


00100010 


22 


third byte 


10100111 


a7 


xor above two rows 


OlOOi 111 


•♦I 


snm left circular 


OOO 10001 


11 


fourth byte 


01011110 


Se 


xor above two rows 


10111100 


be 


shift left circular 


00000000 


00 


rifth byte 


10111100 


be 


xor above two rows 


O1111001 


79 


shift left circular 


1O000110 


86 


sixth (check) byte 


11111111 


ff 


xor above two rows 


11111111 


ff 


Shift left circular 



"^^^ general interpretation of the packet command is given in the following table 



value 


interoretation 


payload 


0 


idle 


0 


1 


error 


0 


2 


write-allocate 


12 


3 


write-noaiiocate 


12 


4 


read-allocate 


4 


5 


read-noallocate 


4 


6 


read-response 


8 


7 


write-response 


0 



The module address field provides for as many as four Hermes slave devices to be 
operated from a single channel. Module address values are assigned via either 
static/gcometnc configuration pins (not recommended) or dynamicaUv assigned 
via a Cerberus configuranon register. . » 

The link identification field provides the opponunity for Hermes master devices to 
initiate as many as eight independent operations at any one time to each Hermes 
slave device. Each outstanding operation to a Hermes slave device must have a 
distinct bnk identificauon number, and no ordering of operations is implied by the 
value of the hnk-identification field. There is no requirement for link-identification 
tield values to be sequentially assigned m requests or responses 
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The following section provides detailed descriptions o£ the structure ot each tvpe 
ot command packet. 

Idle 

Idle packets fill the space between other packets with an alternating zero-bvte and 
all-ones-bvtc pattern. Idle packeis may be dropped when received and 
regenerated between outgoing packeis. The idle packet is formatted as follows- 



7 


0 


ma 


com| lid 


check 


8 



Idle packet 



The range of valid values and the incerpretation of the fields is given bv the 
toUouing table: 



field 


value 


inrerDreiation 


ma 


0 


Mcauie address field must be zero. 


com 


0 


Packet is 'idle. * 


lid 


0 


LinK identification number field must 
be zero. 


check 


255 


Check integrity of packet 
transmission. 



No activity is performed upon receipt of a properly formatted idle packet. 

Read Operation 

Read packeis cause a ticrmes device to perform a read operation for the specified 
address, producing a data value. The value is read from cache, if one is present 
and the address is present in the cache. If the address is not present in cache, the 
value is^read. A value read is placed in the cache if the command is "read- 
allocate"; if the command is "read-noallocate" the value is returned without 
copying the value into the cache. The packet format is as follows: 

J 0 
malcomj lid 
addr7„o 



addrBA>i,.8A-8 
check 



Read packet 



329 



Case 2:05-cv-00505-TJW Document 149 Fitedl 0/1 5/2007 Page 13 of 40 

wo 97/07450 PCT/US96/13047 



:^L"n 'ubl:"'" incerpretanon of che fields is given by 



given 



the 



field 


V3lU6 




ma 


0..3 


Mcauie aaaress. 


com 


4.5 


H^acKet command is "read-allocate"' 
or "read-noallocate " 


lid 


0..7 


H^rspond With link identification 
num.ber Id. 


addr 


0..2aA.i 


Logical memory block address as 
specified. The leasi significant byte 
IS sent first. 


check 


0..255 


Check integrity of packet 
transmission. 



11'^'^''^^' "If* T''^'"^ ^'^^^^^ ^^-"hin the range of the memorv 

equTstS ilta llt'ri^ri response packet . generated w'hich con.S .He 
requested data value. The read-response' packet is tormaiied as foUows: 

majcomt iid" 



'•'^^•IWBWl'i'ffl 



I check 1 



Read-resDonse packet 
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fdlow".^ table:"'"^ """^ interpretation oF the fields is given by the 



tield 


value 


mteroretation 


ma 


0..3 


Moduls aadress ma as specified m 
read oacket. 


com 


6 


^^acKei commana is "^read response ' 


lid 


0..7 


LinK identitication number lid as 
specified in read oacket 


data 


0..2eW.i 


Data read trom speciTied address 


check 


0..255 


Cneck integrity of packet 
transmission. 



In order to reduce the latency of read response. Hermes devices mav generate a 
read response packet before checking redundant information that mav aher the 
hZTrU '^"a information, but before the last 

tr?nl.rt/fn '"''°"u ^""^f' '^g^ne"^ed. the device detects that the data u-as 
ancZ in„ r'''" ■■"^'"Ped-" that is. marked as mvalid. by 

Wh . nil ''^^^^y'^ that IS the ones-complement of the proper check bvte' 
buch a packet must be ignored by Hermes masters and mav be either lenored or 

cnTrrT\ ^T" ^"T"" ^^'^""^^'^^ information indicates a 

correctable error, the stomped packet is toUowed by a read response packet which 
contams the corrected data. 
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Writs Ooemtinn 

VC'rite packets cause Hermes devices to perform a write operation, placina a data 
value into the specified address. The value ,s written into cache, if Jne is^present 
and the address is present in the cache. If the address is not present in cache, and 
the command is vvrue-allocate . the value is written into cache. If the address is 
not present in cache, and the command is •wrue-noallocaie". the value is written, 
leaving the cache location unchanged. The packet format is as follows: 

ma|com| lid 



addr8A>i..aA>a 



I Check I 

^ a 

Write packet 

range of valid values and the interpretation of the fields is given bv the 
ving table: 



field 


value 


interpretation 




a.3 


Module address. 


com 


2.3 


Packet comnnand is "write-allocate" 
or "write-noallocate." 


lid 


a.7 


Respond with link identification 
number lid. 


addr 




Logical mennory block address as 
specified. The least significant byte 
is sent first. 


data 


0..29W.-I 


Data to be written at specified 
address. 


check 


0..255 


Check integrity of packet 
transmission. 



Write packet field interpretation 
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If the fields ace valid ar^d the specified address is u ithin the range of the memorx- 
the memory ,s written and a write response packet is generated. The "'wnte- 
response packet is formatted as follous: 



m a{com| lid 



check 



3 



Write resDcnse packet 



Slowfng^abk- ''^^''^ ^""^ interpretation of che fields is given by the 



tieid 


value 


inieroretation 


ma 


0..3 


McGule address ma as specified in 
v/r:t5 Dacket. 


com 


7 


HacKet command is "write resoonse " 


lid 


0..7 


LinK Identification number lid as 
scecified in write packet. 


check 


0..255 


Chsck integrity of packet 
transmission. 



Error Handling 

The receipt of packets that do not conform to the requirements of this 
specification over the input channel is an error, as are anv conditions internal or 
external to the device that prevent proper operation, 'such as uncorrectable 
memory errors. The level or degree to which an implementation detects errors is 
implemencauon-defmed; to the extent possible, this architecture specification 
recommends that aU errors should be detected, but this is not strictlv required All 
implementations must document the level of error detection, and all detected 
errors must use the method described below for handling errors. 

For Hermes devices, the following errors should be detected and the level of error 
detection for each of these errors is required to be documented: 
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errors Oetected 



invatid cneck byte 
invalid command 



invalid address 
uncorrectable error in cache 



uncorrectable error in device 
invalid identttication number 
internal buffer overflow 



invalid module address on idle packet 

invalid identification nunnber on idle/error packet" 



invalid check byte on idle packet 



Lmr^ nH M "^'^ ^''^^ check bvte. invdid 

oHn omVriLT?"^' ""^"^'"''^ invalid address, invalid identification number, 
or m some implementations cause inrcrnal buffer overflow. For each such error 

lndtt\'^^^^^^^ r' — ^ response e?pSv 

mdicarmg such a condition error response); the packet is otherxvisc ignored .\lso 

iZ'TL" r ""^^"^^^^^'^ "^-^ euher the cache or the de 'ce rLS 

rSponsr^^^^^^^ ' """"" "P"^ "'"^'"^^ '^'^^'^ - of an error 

The error response packet is formatted as follows: 

7 0 



ma|com| lid 



check 



Error response packet 



□ 



Mo^nt^Mj"^'^ '"'^ interpretation of the fields is given by the 



field 


value 


interpretation 


ma 




Module address Identifying the 
source of the error response packet 


com 


1 


Packet is "error response." 


lid 


0 


Link identification number nnust be 
zero. 


check 


0..255 


Check integrity of packet 
transmission. 



error response packet field interpretation 

Upon receipt of the error response packet, the packet originator must read the 
■Cerberus status regjster of the reporting device to determine the precise nature of 
the error Hermes devices reporting an invalid packet wiD suppress the receipt of 
addmonal packets until the error is cleared, by clearing the status register. 
However, such devices may continue to process packets which have already been 
received, and generate responses. Upon taking appropriate corrective actions and 
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commld^ '^ '^^"'^ unacknowledged 



channel ,L T U '''' high-banduidth Hermes 

channel and the Cerberus senal.bus mcertace. it is generaUy safe to assume that 

cSerus vvuf Z,T' '"T'' ^''t''' ^"^'"P' the status register 

Lerberus vmU result in reading stable, quiescent error conditions and that the 
queue ot outstanding requests will have drained. After clearing the status register 
^he HcrrScrdevfcJ''' originator may immediately resume sending requests to 

Forwarding 

Hermes devices, whether rnaster or slave, may have the capabiUtv to forward 
s'a've'de^^ice? f,7'f«"d^j.f°^other devices connected to a Hermes' channel, for 
mnli; • '5 performed on the basis of the contents of the 

Sa^ rt? nf ?k' ^"^'^ Packet packets which contain a module address other 

uch moduL ^""''V^ (onv^jdcd. All non-idle packets which contain 

such module addresses must be forM-arded. including error packets For master 

are fonvarded generated by the device 

To minimize ring latency it is generally desirable to for^vard these packets with 
minimal latency. If a packet arrives at an input channel when the output channel 
is m use. this latency must mcrease; at least a single-packet buffer is required. 

The size of the forwarding buffer is implementation-dependent. Avoiding the 
generation of an output packet if the fo^^varding buffer does not have room to bold 
an additional input packet s required, when the forxvarding buffer is smaller than 
the number of packets which may require foru-arding (generallv 24 packets) 
However this strategy may cause star^•atlon, as output packets mav be inhibited 
indcfinately by a stteairi of input packets that require forwarding. Star^-ation mav 
be avoided by system-level design and configuration considerations bevond the 
scope or this spedficauon. 

Packets which contain a check byte error mav be forwarded; however it is 
recommended that such packets be transmitted with a check byte containing 
more than one error bit, to minimize the possibility of an undetected second error 
Packets which contain a "stomped" check byte mav be forwarded as is, or may be 
ignored by a forwarding device. Note that when a packet is forwarded with 
imnimum latency, the output channel may begin transmitting a packet before the 
input channel has received the entire packet: in such a case, the only available 
choice IS to continue forwarding the packet even if a check bvte error or 
stomped check byte is detected. 
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Ring Confioumtinn^ 



Hermes supports a voriety of ring configurations. All devices in a cascade must 
have the same values for A and W parameters, in order that each part mav 
properly interpret packet boundaries. The table below summarizes the 
characteristics ot the configurations available: 



corfiguraiion 


masters 


slaves 




number 


tcr.vardinq 


number 


forwarding 


master-slave pair 


1 


nc 


1 


no 


single-master ring 


1 


no 


1-4 


yes 


dual-master pair 


2 


no 


0 




multiple-master single- 


1-8 


yes 


1 


no 


slave ring 




multiple-master multiple- 
slave ring 


1-8 


yes 


1-4 


yes 



Hermes ring ccnfigurations 



Master-slave PRir 



The simplest ring consists of a single Hermes non-forwarding master device and a 
single Hermes non-fon^^arding slave dcnce. No forwarding is required for either 
device as packets are sent directly to the recipient. The ring may have as many as 
eight transacuons outstanding, each containing distinct id field values. 




Sinole-master Ring 



A single-master ring may contain a cascade of up to four Hermes slave devices. 
The cascade of devices will have the same or greater bandwidth as a single device, 
but more latency. Each Hermes slave de\-ice must be configured to a distinct 
module address, and each slave device must forward packets that contain module 
address fields unequal to their own. 




Packets are explicitly addressed to a particular Hermes device; any packet 
received on a device's input channel which specifies another module address is 
automatically passed on via its output channel. This mechanism provides for the 
serial interconnection of Hermes devices into strings, which function identically to 
a single device, except that a string has larger capacity and longer response 
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latency. Each slave device m;.v have as many as eight transactions outstandina 
each containing distinct id tield values. uuibianuin,,. 

Dual-maste=)r Pf?ir 

A dual master pair consists of two master devices and no slave devices Each 

eTcrmarhLr-' u'^ '""^ °P"^"°"^ '^^'^^'^^ »° other and 

:^te7 or "L?. ".ft L^^^^^^^^ .No forwarding is 



1 

i H 


M 




M 






Hermes dual-master pair 



Multiple- master .^innis-slav^ Qinn 



A multiple-master ring may contain multiple master devices and a single Hermes 
fa rhetr"' P^^^'^^^^" "'^"^ ''^^''"^ "^^"S^ "^e differem id ^abe 
deviis m.TS"- a"^ T'" ^''V^"^ °f 'he eight transactions. Master 
deuces, must ior^yard packets not specitally addressed to them, as designated by 

pa^k^^-e^^^^^^^^^ t: 1- ""^ f--^ P-^ets. as^U inpu^ 



n 



M 



M 



M 



M 



Hermes multiple-master single-slave ring 



Multiolf^- mastffr Mi iir,ole-^la\/c^ Ring 

A mukiple-master ring may contain multiple master devices and as manv as four 
Hermes slave devices, provided that the master devices arrange to use different id 
values for their requests. Each slave may have up to eight transactions 
outstanding, and each master may use a share of those transactions. Master 
devices must forward packets not spccifally addressed to them, as designated bv 
the values in the id field. Slave devices must forward packets not specifically 
addressed to them, as designated by the value of the ma field. 



M 




M 




M 




M 




S 




S 




S 




s 



Hermes multiple-master multiple-slave ring 



Response Packet Timing 

In general." a received packet which is interpreted as a command causes a 
response packet to be generated. The latency between the end of the request 
packet and the beginning of the response packet is affected by the processing and 
forwarding of other packets, by the presence or absence of the requested word in 
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the cache, as well as implementacion-dcpcndenc device parameters 
characteristics. 



and 



\Xith tulJ knowledge ot the cache state, configurable parameters and 
implementation-dependent characteristics, a Hermes master mav completelv 
model the latency- of responses. However, dependence on such characteristics is 
not recommended, except for testmg and characterization purposes. 

A Hermes master must have the capability to detect a time-out condition, where a 
response to a request packet is never received. The Icneth of the time-out is 
implemientation-defined. and dependent upon the implementation of the Hermes 
slave devices, so u is recommended that this time-out be long enouoh to 
accomodate variation m the design of Hermes slave devices, or be configurable to 
permit recovery m a minimum implementation-dependent delay. 

Cerberi y.q ReoifitRr.ci 

The Hermes channel architecture builds upon the Cerberus serial bus 
dSL^d below ' 'P'"'^' requirements of Hermes-compliant deSces a« 

Hermes requires that the values of A and logzW be made available in the hbh- 
order byte of the fust architecture description register as indicated beloV " 

The format of the register is described in the table below. The octlet is the 
Cerberus address of the register; bits indicate the posifion of the field in a register. 
Ld r i'"""^ V^l hard-wired value in the register for a read/only reiister. 
and IS the value to which the register is initialized upon a reset for a read/write 
register If a reset does not initialize the field to a value, or if initialization is not 
required by this specification a « is placed in or appended to the value field. The 
range is the set of legal values to which a read/write register may be set. The 
interpretation is a brief description of the meaning or utility of the register field; a 
more comprehensive description follows this table. 



octlet 

4 



63.. 60 
59..56 
55. 0 


A 


4 


1..15 


size of a Hermes address 


log2W 


3 


a. 15 


size of a Hermes word 


not specified 






not specified by Hermes architecture 



Architect ure Df=<^riotion Rf=>ni'=;t(=r^ 

The architecture description registers in ocdets 4 and 5 comply with the Cerberus 
speciticanon and contain a machine-readable version of the architecture 
parameters: A, W described in this document. 
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Cerberus Serial Su.s 



MicroUnitys Cerberus serial bus architecture is designed to provide bootstrap 
resources, configuration and diagnostic support tor MicroUnitv-s Terosichore 
system architecture. 



TTL levels, for direct 
a continuously running 



The Cerberus serial bus employs i%vo signals, both at 

communication among as many as 2^ devices. One signal is a .unimuousiv running 
Clock and the other is an opcn-coUector bidirectional data signal. Four additional 
signals provide a geographic S-bit address for each dexice. A gateway protocol and 
optional contigurable addressing each provide a means to extend Cerberus to as 
many as 2''' buses and 2^"* devices. 

The protocol is designed for universal application among the custom chips used to 
implement the Terpsichore system architecture. It is also designed to be 
compatible with implementations embodied in FPGA parts, such as those made 
by Xihnx. Altera Actel and others. Such FPGA pans mav be used to adapt the 
Cerberus protocol m a minimum of logic to attach smaU serial bus devices, such as 
those made by Dallas Semiconductor (EEPROM. serial number parts). ITT (1MB 
bus). Signetics {r2C bus). It is also a goal that such FPGA parts can be used to 
adapt the Cerberus protocol to communication over EIA-232/422/423/485 links to 
existing systems for the purposes of s>-stem development, manufacturing test and 
contiguration, and manufacturing rework. 

The Cerberus serial bus is used for the initial bootstrap program load of 
Terpsichore; the bootstrap ROM connects to Terpsichore via Cerberus. Because 
the Cerberus must be operational for the fetch of the first instruction of 
Terpsichore, the bus protocol has been devised so that no transactions are 
required for initial bus configuration or bus address assignment. 

Electrical Signalling 

The diagram below shows the signals used in Cerberus. 



SN 

3..0 

.IT U ' ' 


Device 




1 sc 


i j 


4 

SD 




















Cerberus Signals 
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The SC signal is a continuously running clock signal at TTL levels The raie i<: 
stotlim'orT' -a.n,u. 0 (DC, .ini.um. the SC I^nal U s^urled f^^ 
unfpecK • ^ ■ '^'°"=*^ ' °f ^^^l^ich is 

InL^'Z^ 1 "'""^ ' ^"""""'^ ^^"g^h °f bus and quality of the 

noise and signal termination environment. The amount of skew in the SC s enal 

J?gh)'Ss!o^a; ^:^':^ts:'^itn - ' = = 

devices on Cerberus. ' ''^ communicauon among 

Sbt lSril'Jr'^'"?"°" ""'T'^' "1^ "^^^ °" ^'g"'^- depending upon 
Xmes emn ovfr ""^^""^ -^^u'^ O"'^ °f ^^c simplest 

SovT Vcc aT„ ""T P""""P °* ^'^^ equivalent of 220 Ohms to 3.3 Volts 
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The follovving table specifics parameters chat must be met bv Cerberus-compUant 
devices. Voltages are retercnced to Vss. 



Hecommended ooeratinq conditions 


MIN 


INOM 


MAX 


UNIT 


Operaiing free-air tenr^oerature 


0 




70 


C 






tlectncal characteristics 


MIN 


1 YP 


MAX 


UNIT 


Vol- L-state cutout voltage 


0 




0.5 


V 


VfH: H-state input voltage SD 


2.0 




Vt+0 5 


V 


V|h: H-state input voltage SC. SM3 0 


2.0 




5.5=3 


V 


Vil: L-siate input voltage 


-0.5 




0.8 


V 


Iol: L-state output current^^ 






16 


mA 


Iqz: Off-state output current^^ 


-10 




10 


HA 


Cout: Output Capacitance 






4.0 


PF 










bwitching characteristics 


MIN 


TYP 


MAX 


UNIT 


tc: SC clock cycle time 


50 






ns 


tcH^ SO clock high time 


20 






ns 


tcL^ SC clock low time 


20 






ns 


tr: SC clock transition time 






5 


ns 


ts: set-up time. SD valid to SC rise 


0 






ns 


tn: hold time. SC rise to SD invalid 






1 


ns 


too: SO rise to SD valid 


5 






ns 



Geooraohic addrRRRinn 

The objective of the geographic addressing method in Cerberus is to ensure that 
each device is addressable with a number which is unique among all devices on 
the bus and which reflects the physical location of the device, so that the address 
remams the same each time the system is operated. 

When a system requires at most 16 devices, the geographic addressing method 
permits the assignment of addresses 0 through 15 by direcdy wiring the low-order 
4 bus of the address m bmary code using input signals SN5 o- For these purposes. 
vnxmg to a logic high (H) level supplies a value of 1. and wiring to Vss or logic low 
(L) level supplies a value of 0. . 



^^Ccrbcnis recommends, but not require, compliant devices be abJe to sustain input levels 
provided by 5V TTL-compatible devices on die SC and SN3 q inputs. 

'-^Devices which fail to comply ^i-ich the low-siaic output current specification may operate with 
Cerberus-compliant devices, but may require changes to the termination network System 
designers should evaluate the effect that limited drive current x^-ill have on the worst-casc Low- 
state signal level. 

55Devices which fail to comply with the off-state output current specification mav operate with 
Cerberus-compliant devices, but may limit the number of devices which may co-exisi on a single 
Cerberus bus. System designers should evaluate the effect that additional leakage current wilf 
have on the worst-case High -state and Low-state signal levels. 
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through'lV^^^°''' '"""^ P'""" "^l^ device address from 0 



Device 
aadress 


Binary 
code 


OIN3 


of\l2 


O ft, 1 


SNo 


0 


00000000 


L 


L 


1 


1 


1 


00000001 


L 


L 


1 

1— 


n 


2 


00000010 


L 


L 


H 


1 

L 


3 


00000011 


L 


L 


H 

n 


1 1 
n 


4 


00000100 


L 


H 


L 


L 


D 


00000101 


L 


H 


L 


H 


6 


00000110 


L 


H 


H 


L 


7 


00000111 


L 


H 


H 


H 


8 


00001000 


H 


L 


L 


L 


9 


00001001 


H 


L 


L 


H 


10 


00001010 


H 


L 


H 


L 


11 


00001011 


H 


L 


H 


H 


12 


00001100 


H 


H 


L 


L 


13 


00001 101 


H 


H 


L 


H 


14 


00001110 


H 


H 


H 


L " 


15 


00001 111 


H 


H 


H 


H 



An extension of this method is used for the assignment of addresses 0 throuah 255 
when a system requires more than 16 devices, up to 28 devices. Additional code 
combmations are made available by Nx-iring each of the same input signals SNj o as 

.iln,u'° T(? T ''^"'r^ ^'c:i:° L and H. and nvo additional 

Jsr m\% \ SC signal and an inverted copy of the SC signal 

T^^oJ'"" J.^"^ ^i^<=d t° of four values. 

4-'=2°=256 combmations are possible. 

The wiring pattern is constructed using the algorithm: If the desired device 
address is the value N. for each input signal SN,. where x is in the range 3..0. wire 
SN, to one of the four signals L. H, SC, or SC_N, according to rht foUowinB table, 
dependmg on the value of bit 4+x and bit x of N. 



N4*x 


Nx 


SNx 


0 


0 


L 


0 


1 


H 


1 


0 


SC 


1 


1 


SC N 
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The table below indicates the wiring pattern for some device addresses 



Device 
address 


Binary 
code 


SN3 


SN2 


SNi 


SNo 


16 


00010000 


1 


1 

L 


L 


SC 


17 


000 1 000 1 


1 


1 

L 


L 


SC_N 


18 


000 1 00 1 0 


1 


1 

L 


H 


SC 


19 


0001001 1 


L 


L 


H 


SC N 














29 


00011101 


H 


H 


L 


SC N 


30 


00011110 


H 


H 


H 


SC 


31 


0001 1111 


H 


H 


H 


SC_N 


32 


00100000 


L 


L 


SC 


L 


33 


00100001 


L 


L 


SC 


H 


34 


00100010 


L 


L 


SC_N 


L 














254 


11111110 


SC N 


SC N 


SC_N 


SC 


255 


11111111 


SC N 


SC_N 


SC_N 


SC_N 



The diagram bel6w shows the waveform of the SC signal and the four signals that 
each ot the SN3. 0 inputs may be wired to. 
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The values shovvn in the diagram above are decoded using four copies of the 
following logic, one tor each value ot.x in the range 3 .0: 




The NU and NL values are combined together in the order: 

|_NU3 I NU2 I NU. I hL I nL I nL I I . I 
1 1 1 1 \ ' i ' i ' ~J 

to construct an S-bii device number by which operations are addressed. 

Bit-Level ProtnnQ l 

The communication protocol rests upon a basic mechanism bv which anv device 
tnay transmit one bit of information on the bus. which is received bv all devices on 
the bus at once. Implicit m this mechanism is the resolution of collisions between 
devices which may transmit at the same time. octwccn 

Each transmitted bit begins at the rising edge of the SC signal, and ends at the 
next nsmg edge^ The bit value is sampled by all devices at the next rising edge of 
the SC signal, thus pennitring relatively large signal settling time on the SD signal, 
provided that skew on the SC signal is adequately controUed. 

The transm«sion of a zero (0) bit value on the bus is performed by the transmitter 
driving the SD signal to a logical-low value. The transmission of a one (1) bit value 
on the bus IS performed by the transmitter releasing the SD signal to attain a 
logical-high value (driven by the signal termination network). If more than one 
device attempts to transmit a value on the same clock period (of the SC signal) the 
resulting value is a zero if any device transmits a zero value, and is a one if all 
devices transmit a one value. We define the occurrence of one or more devices 
transmitting a zero value on the same clock cycle where one or more devices 
transnut a one value as a collision. 
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Because of this wired-and coUision mechanism if a device rr,ncmit. . i 
it cannot determine whether anv other devices are tr.n mirl ^ 
If a device cransmus a one value, u C3nToniror%rr«rnr;;u: Tcle^S 
s"i^ clock cSrr ^"W"*^" '^'-'^^ uans^ttmg a'ze o -aLe on the 

r ale clo k cvl" ".'^ ""^^^ ^^^ '^^^ 'he same value on 

::^:j^:'tdtr-^^^^^^^^^^^^^ detect .He 

^vfo?Iinl?7''^'"'"" "'"l '° ^'S^" of the protocol, where if 
'olltsion occur7Tn l?.r" u""' °' °" ^^'^^ rrans^ctton no 

occurs normX" Thei Lnf « designed so that the transaction 
are r«et a^ the same r me T '"^^"^^"^ ^^^q^^ndy if cu'o identical devices 
TrocSso s each ferS L '"'^ "S^" transactions, such as two 

processors each fetching bootstrap code from a single shared ROM device. 

Packet Prntnr.nl 

bu\' in't^HZ—'i "I" ''"•^'^'^ mechanism to transmit information on the 
cmZ T -^"^ '"'^^"P'^ °f ^'ght b.ts. whUe resolvin.. potential 

lie ransmtZ'" ^""'T "'l^" ^''""I'^neously begin transmittini a^^ackit 

S conrroE^E ^ f- ^^'^ ''^^^"f of single-b.t' transmission Irrofs. and 
tor controlhng the rate of information flo«v%vith eight-bit granularity. The protocol 
also provides for the transmission of a system-level reset. Protocol 

Each packet trarjsmission begins with a single start bit. in which SD alwavs has a 

taninf ^tE th^ r- '^i^ °i ^^^^ ^'"^ ^^'^ ^"-I^^ tran mk"d 

bhTirsmiJtei r f?^"*^""^ transmitting the eight data bits, a pantv 

bit IS transmuted. If transmission continues with additional data, a single one 

/WV/vyVA7WWWWVAAAAA7W 



sc 

SD 



Packet transmission 



panty dalay 
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Ochennse. on the cycle following the transmission of each parity bit. anv device 
nm- demand an additional delay ot two cycles co process the data bv drinnT h" 

other tv^.'^F ^ ''^?t'''' ^"^-^^ ^ value) bv anJ 

SD Siena h. ""*"" '^f's '''''^'^^f ^^P"""S ^^^^ P^«"" of driving the 
i>D signal (to a zero value) for one cycle and releasing the SD signal (to a one 
vdue) tor one cycle and ensuring that the signal has been released Addi ional 
"TT'i immediately after the bus has been one (released) on the 
d (delay) clock cycJe. without additional stan bits, as shovvn in the figure beU 



nantw deiav ahrrt ' / l-.^— lliJL— 



I panty deiay aCcn aetay abon aetay 

delay between bytes 



Dar tv errorTrf ^i SlT ' "^"^"y because of a detected 

«r?vaTue) on the l"'?''"!?" f ^"^'^"g '^^^ signal (to a 

cvcL fof a total of ,7^ 't^^ '^L ' ^^^U 'he next ten 

detected bv aU deL« ^^^"O"^! cycles ensure that the abort is 

detected by all devices even under the adverse condition where a sin-^ie-bit 
transmission error has placed devices into inconsistent states. Each device tha 
detects an abort drives the SD signal (to a zero value) for ten cvcks after " V' 
abort, cyde state, so in the most adverse case, an abort may have Lic^^^^^ 
cvde) t ani«S ^^^.f "»^;f c>cles. The figure belol shows a tvpical (if 
tTansLTon ''"'"^d^^^^^ re-iransmission of the 



SD 



(WWWUW\7UW[7WV\ 



parity May Sbon 



fBset Stan 

abort follQwed by re-transmission 



drivin. 1 r- "^f ^Z'^""' ''"^ ^" Cerberus devices, bv 

en^e^K«r ' '"u '''^"^^ " ^"'^ ^his is sufficient to 

ensure that aU device receive the reset no matter what state the device is in prior 
to the reset. Transmission may resume after the SD signal is rdeased (to a one 
value) for two cydes, as shown in the figure bdow 



SC 
SO 



ATWUWVTmTWuTOWA: 

' ' ' ' ' ' ^ r r t r r n TA s ^ 

fCMt reset reset reset «set reset reset reset rese. reset reset reset reset Mart start ' 

reset follow^ed by transmission 
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The state diagram below describes this protocol in further detail: 




Serial transmission state diagram. 



The table beW describes the data output and actions which take place at each 
state m the above diagram The next stare for each state in the table is either 
column go-0 or go-1. depending on the value of the in column 



State 


out 


in 


go-0 


go>i 


action 


s 


s 


's 


0 


S 


s = 0 iff transmit first byte. Must wait in this 
state one cycle (with s=1) if transmitting a 
new transaction. 


0 


do 


'0 


1 


1 


bit 0 (LSB) of data. If do &~ io. lose 
arbitration. 


1 


di 


"1 


2 


2 


bit 1 of data. It di &~ h. lose arbitration 


2 


d2 


'2 


3 


3 


bit 2 of data. If d? i2. lose arbitration 


3 


d3 


•3 


4 


4 


bit 3 of data. If d'^ &- 13, lose arbitration 


4 


d4 


14 


5 


5 


bit 4 of data. If d^ &- 14. lose arbitration. 


5 


d5 


is 


6 


6 


bit 5 of data. If d^ 1^. lose arbitration. 


6 




'6 


7 


7 


bit 6 of data. If dg &- 1^. lose arbitration. 


7 


d7 


•7 


P 


P 


bit 7 (MSB) of data.. If dr 17. lose 
arbitration. 


P 


P 


Id 


d 


d 


P - -^17 0 (odd parity); abort if p^\o 


d 


d 


id 


a 


s/0 


d = 0 iff transmit delay, abort, or reset. If 
id=l. go to State 0 if not last byte of packet: 
else state s. 


a 


a 


ia 


g 


d 


a = 0 iff transmit abort or reset. If ia = 0. 
abort transaction. 


g 


0 


N/A 


g/r 


N/A 


stay in state g 10 times, then qo to state r. 


r 


r 


if 


r 


s 


r = 0 iff transmit reset. If ir = 0 and have 
been in this state 12 times, reset device 



In order to avoid collisions, no device is permitted to stan the transmission of a 
packet unless no current transaction is undcru'ay. To resolve collisions that may 
occur if two devices begin transmission on the same cycle, each transmitting 
device must monitor the bus during the transmission of one (released) bits. If any 
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of the birs of the bvte are received as zero (drk-Poi «-K»„ ► • • 

(released), the device has lost arbura.on and mt 'rL^^^^^^ 

the current byte or transaction. "° addiuond bus ot 

of .he b,. of .h.cu^.^t„— nrr^teVis 



All other devices must wait one addicional SC (c\nr\c) u^i • 



SD 



'M^^\iMii(DiSg(iKiiT 



parity <Je!ay siaa s»art 

re-transmission after loss of arbitration 



panty d9lay 



dSck~^vdeTbil^ ""T ' '^^-^"^ °f "° ^han 256 idle 

Clock cycles benveen the packets of a transaction. After seeing this manv idle 

bort th'etrr^Lc^S^s^SLT^'" ""^^'^ cyclesfTu'ch dev^:::^'™^ 

bytes of zeroes transmimng a ume-oui packet, u-hich consists of tuo 

Slow devices may require more cycles between the transmission of packets in a 
transaction than are permitted as idle clock cycles. Such devSs ma^v av^kl^he 
nme-out hmit by delaying the completion of the transmission of th^e pri^-ious 
packet untd the idle time is less than the time-out limit, as sL^-n 7the Se 
below. In this way, devices of any speed may be accommodated ^ 



SC 

so 




parity dolay abort delay aton detay sran 

delaying completion of previous packet until 



ready to respond 



pa my 



col sionrT.7 "^"'ator-capable and other devices cooperativelv avoid 
collisions between the time-out packet and transaction responses The 
i bZ. t r ^^'T '° ''""^^^ transmission of a time-our packe 

In f P'^'^'k b=/""^™"fd. some device begins transmitting, 

even if such a transmission begins after 256 idle clock cycles have elapsed If the 
design of a target de vice ensures that no more than 256 idle clock cvcles elapse 
betueen packets of a transaction, it need not be concerned of the possibaitv ot a 
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collision during the transmission ot a response packet. Otherwise the 
responsibility of the target devices is to inhibit transmission of a response if 'some 
other device begins transmitting a tinie-out packet after at least 256 idle clock 
cycles have elapsed. 

A device which requires delay after an aborted transaction or a reset may cause 
such a delay by forcing the delay bit after the first bvte of the immediately 
toUou-ing transaction as required. If in such a case, the device cannot keep a copy 
ot the hrst byte of the transaction, it may force the transaction initiator to 
retransmit the byte by aborting that loUowing transaction after a suitable delay 
has been requested. 

Transaction ProtnnnI 

A transaction consists of the transmission of a scries of packets. The transaction 
begins with a transmission by the transaction initiator, which specifies the target 
net device, leiigih. type, and payload of the transaaion, request. If the t%-pe of the 
packet IS in the range 128 .255. the target device responds with an additional 
packet, which contains a length and type code and pavload. The transaction 
terminates with a packet with a type field in the range 0..127. otherwise the 
transaction continues with packet transmission alternating between transaction 
initiator and the specified target. 

The general form of an initial packet is: 

tnotniidelLl Tlpolpii 

The general form of subsequent packets is: 

H-tTIPolPil [p^ 
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The range of valid values and the interpretation ot the bvtes is given bv the 
toUouino table: • ° • 



Field 


Value 


intercretaiion 


no, ni 




neiVvorK address of target, relative to 
network address of transaction 
initiator. Value is zero (0) if target is 
on same bus as transaction initiator 


de 


0..255 


device address, in this case, an 
absolute value, i.e.. not relative to 
device address of transactinn 

initiator. 


L 


0..255 


payload length, or number of bytes 
after transaction code (T) 


T 


0..255 


transaction code. If the transaction 
code is in the range of 0..127. the 
transaction is terminated with this 
packet. If the transaction code is in 
the range of 128. .255, the 
transaction continues with 
additional packets. 


P0,P1,...PL.1 


0..255 


Payload or transaction. 



The valid transaction codes are given by the following table: 



mnemonic 


L 


T 


interpretation 


te 


0 


0 


transaction error: bus timeout, 
invalid transaction code, invalid 
address 


tc 


0 


1 


transaction complete: normal 
response to a wriie operation 


d8 


8 


2 


data returned from read octlet 






3.. 127 


reserved for future definition 


w8 


10 


128 


wile octlet 


r8 


2 


129 


read octlet 






130..255 


reserved for future definition 



All Cerberus devices must support the transaction codes: te. tc. d8. w8. and r8. 

M Cerberus devices monitor SD to determine when transactions begin and end 
A transaction is terminated by the completion of the transmission of the specified 
number of payload bytes in a transaction uith code in the range 128 255 or by the 
transmission of an abort sequence. For purposes of monitoring transaction 
boundaries, only the L byte is interpreted: the value of the T byte (except for the 
high order bit) must be disregarded. This is of panicular importance as manv 
transaction codes are reserved for future definition, and the use of such 
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ransaction codes between devices which support them must be permitted even 
hough other devices or, the Cerberus bus may not be aware of the r^eaninrof 
uch transacfons. A Cerberus dev.ce must permit anv value m the L bTe^fo 

transactions addressed to other devices, even if only 'a limited set of va ues i 

permmed tor transactions addressed to that device. 

Transactions addressed to a device which does not provide support for the 
tartt de4r''"°" ' ^'"-"^^ ^'^^"^'^ ht aboncd by the addressed 

The selection of the payload length L and transaction code I for the transaction 
nXf U 01. particular note. Because the value of all information b ts of the 

?otSt;^s;vsxi^ - ^•^^^•^ — ^^^^ p-^« ^^-^ 

The "write ocilet; transaction causes eight bytes of data to be transferred from 

device'ndH""" VT"'°' '° '^"^'"'''^ "^^^ ^^^'^^ - octlet-align d 16 bh 
dev.ce address. The transaction begins with a request packet of the form: 

Inolm tdenolwSiAnlA.l DolD, |D,iD,|D^|D^|D,|D^| 

The normal response ro this request is of the form: 

CUE] 

The error response to this request is of the form: 

CUE] 

The 16-bit device address is interpreted as an octlet address (not a byte address) 
and « assembled from the A© and A, bnes as (most significant byte is transmitted 



15 8 7 0 
I Ao I Ai 1 



8 8 

The data to be transferred to the target device is assembled into an ocdet as (most 
signiiicant byte is tratismitted first): 

63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 
I Po I Di I I Da t I Ps I I on 
8 8 8 8 8 8 8 8 

Side-effects due to the alteration of the contents of the octlet at the specified 
address are only permitted if the transaaion completes normally. In the event that 
the u-rite octlet transaction is aborted at or prior to the transmission of the A i 
byte, the target device must make no permanent state changes. If the transaction 
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the specified address is undefined. Il a terations ot the contencs normaUv would 
co^enK n'f Sr""^?' operation of the Cerberus device or side-effects on the 
suppressed ^^^^^^^^^'^ in the device, these side-effects must be 

lU^lrlthT^u"'^^' "^'7' " not present on the Cerberus bus. the transaction 
m/„.?,1 • • °i 'he octlet data and then stop untU the idJe 

tune-out limit ts reached. At that point, one or more initiator-capable devices will 
generate an error response packet. 

If the addressed target device is present on the Cerberus bus. but the 16-bit 

^^n.I^rfol^"-''' bytes of data to be transferred to the 

dSe ddreT^i°'/'°'" ^^^ressed target device at an octlet-aligned 16-bit 
device address. The transaction begins M-it h a request packet of the fotiii: 

LgoJ_ni jciei 2 I r8 UolAt j 

The normal response to this request is of the form: 

I B I da I Do I I I p., I D. i Ps i I I 

The error response to this request is of the form: 

I 0 I te i 

The 16-bit device address is interpreted as an ocdet address (not a byte address) 
and js assembled from the Aq and A, b>ies as (most significant byte is transmitted 



15 a 7 0 
I Ao i Ai 1 



The data to be transferred to the target device is assembled into an ocdet as (most 
significant byte is transmitted first): 

^3 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 

I Do I Dl i I Da I D4 I Ds i Ds I dTH 
8 8 8 8 8 8 8 ^ ' 

Regardless of whether the transaction completes, the read octlet transaction must 
have no side-effects on the operation of the Cerberus device or the contents of 
other addressable octlets. 
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If he addressed target device is not present on the Cerberus bus. the transaction 
wd] proceed to the point ot transmitting the ocdet address and then stop untU the 
.dk time-out limit ,s reached. At- that point, one or more initiator-capable device^ 
vvdl generate an error response packet. '-«'f<«oic qcuccs 

If the addressed target device is present on the Cerberus bus, but the 16-bu 
?esp"ns:tcTet^^ '''' '''' "'-^ - 

Dedicated Ocmtfi 

StnrXJ'"'" addresses are assigned by which all Cerberus devices mav be 
identified as to device type manufacturer, revision, and bv which devices mav be 

ISfs^u^pi."" *^ ^« ^"^-"d'for 

OCtlSt 63 *J*J i^n U/ Ull 4^ J'J T 1 Oil o g ^ 



O 
1 
2 
3 
4..5 
6 
7 

8.. 
216.1 



56 55 48 47 AO 35 32 31 ?4 33 



identily architecture 



identify implementatinn 



identify manufacturer 
identify serial number 



identify architectural features and options" 
^ specify operating modes 



report operating status 



not specified by Cerberus 



8 



The octlers at addresses 0 through 3 identifies the company which specifies the 
device architecture {e.g. MicroUnity.. the device architecture {e.g. iMnemosvne. 
ierpsichore. Calliope), the implemcnior {e.g. MicroUnitv. partner), the device 
implementation and manufacturer and manufacturing version {e p 1 0 1 1 ' 0) 
and optionaUy a unique device serial number. Addresses 0 through' fare 
tcad/oniy: an attempt to write to these addresses may cause either a normal 
tcrminauon or an error response. Address 3 may be read/only or read/write, 
octlet 63 



architecture code 


« 3 U 

architecture 
revision 


impiemenlor code 


implementor 
revision 


manufacturer code 


manufacturer 
revision 


serial number 


configurable 
address 
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The ocrlet at address 0 contains an architecture code and revision identifier. The 
architecture code and revision identifies each distinctlv designed architecture 
version of a device. NormaDy, a change in the upper byte of the revision indicates 
a change in which features may have changed. A change in the lower byte of the 
revision signifies a change made to repair design defects or upward-compatible 
revisions. 

The architecture code is a unique 48-bit identifier, comprised of the concatenation 
of a 24-bit unique company identifiers^, and a 24 -bit value specified by the 
designated company. This code must not duplicate 48-bit identifiers specified for 
this purpose, or for other purposes, including use of unique identifiers for 
implementation codes, manufacturing codes, or in IEEE 1212. or IEEE 802 IEEE 
802 48-bit identifiers are specified in terms of a binary ordering of bits on a single 
line; for Cerberus, the ordering which is appropriate is that labelled "CSMA/CD 
and Token Bus/ where bits are driven onto Cerberus with the least-significant bit 
of each bvte first. 

MicroUnity^s architecture codes are specified by the following table: 



Internal code name 


Code number 


Mnemosyne 


0x00 40 a3 49 d2 e4 


Euterpe 


0x00 40 a3 24 69 93 


Calliope 


0x00 40 a3 92 b4 49 • 



Refer to the designated architecture specification for architecture revision codes. 



56Company identifiers ore a 24-bii value assigned by authority of the IEEE. Ask for a unique 
company identifier* for your organization: 

Rcgistraiion Auihorir.- for Company Identifiers 
The Institute of Elcctrica! and Electronic Engineers 
445 Hoes Lane 
Piscatawav. N'J 08855-1331 
USA 
(9081562-3812 

MicroUniiy*s unique company idcncificr is: 0000 0000 0000 0010 1100 0101. Only MicroUnity 
may assign unique 48-bii identifiers that begin with diis value. Others may assign 48-bii 
identifiers that begin with a 24-bit company identifier assigned by authority of the IEEE. 

MicroUnity will, upon request, supply unique 48-bit identifiers for architectures, impicmcntors. 
or manufacturers of designs which arc fully compliant with the Cerberus Serial Bus 
Architecture. For assignment of identifiers, contact MicroUnity: 

Craig Hansen. Chief Architect 
Registration Authority for Unique Identifiers 
MicroUnity Systems Engineering, Inc. 
255 Caspian Drive 
Sunnyvale. CA 94089-1015 
fel: (4081 7)4-8100 
Fax: M08) 7)4-3136 



354 



Case 2:05-cv-00505-TJW Document 149 Filed 10/15/2007 Page 38 of 40 

wo 97/07450 PCT/US96/1 3047 



The occle at address 1 contains an implementation code and revision identifier 
The implementation code and revision identifies each distinctly designed 
engineering version of a device. The implementation code is a Linique 48-bii 
identitier. as tor architecture codes. NormaUy. a change in the upper bvte of the 
revision indicates a change in which features may have changed, or in\vhich 
mask layers of a dev-ice have been modified. A change m the lower bvte of the 

mlfk iL"^"' ' u^' T"^' '° ^"'S" ^^^^"^ °^ ^'hich onlv some 

mask layers ot a device have been modified. 

Refer to the designated architecture specification for the values of the 
implementation code and revision fields. 

JtjftT ^'^'^r' ^™a«4f^«urer code and revision identifier. The 

of fll r^;.'? -''^^""ties each distinct manufacturing database 

o an implementation. The manufacturer code is a unique 48-bit identifier, as for 
archi ecture codes. Changes n the manufacturer revision mav result from 
modifications made to anv or all mask lavers rn pnkn«^» « • ' J 
device performance. enhance yield or improve expected 

Refer to the designated architecture specification for the values of the 
manufacturer code and revision fields/ 

It^l^^"^" at address 3 optionaUy contains a unique device serial number or 
random number .and optional y contains a configurable address register. If the 
^ikt does not contain a serial or random number, it must contain a 64.bit zero 

If the octlet contains a unique device serial number, it must be a unique 48-bit 
value, as tor architecture codes. 

If the octlet contains a random number, it must be a value chosen from a uniform 
distnburion. selected whenever the device is reset. 

The optional configurable address register permits a system design in which some 
devices are set to identical Cerberus device addresses at system reset time and 
dynamicaUy have thar addresses moved to unique addresses bv some Cerberus 
device. The configurable address register must be set to the address designated 
by the SN3..0 puis whenever the device is reset. A device which implements the 
configurable address capability must also implement either a unique device serial 
number or a random number, must implement the arbitration mechanism during 
responses from read-ocdet requests, and must ensure that all devices which are 
ongmally set to the same address at reset time respond to a read-octlet with 
identical latency. An initiator device on Cerberus may set the configurable 
address register by reading the entire ocdet at address 3. reading both the 
serial/random number and the configurable address register. By the use of the 
bitwise arbitration mechanism, only one device completes the read-octlet response 
f i™"«or device writes a value to octlet address 3. where the first 

48 bits of the value written must match the value just read. All target devices then 
e.tamine the first 48 bits of the value written, and only if the value matches the 
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comencs of che serial/random number on the device, uses the last 16 bits'- to load 
m:o Che conhgurahe address register. The initiator will repeat this process umil 

^uroccurs"onTe ^^^ress. at ^h.ch time'a bus I 

out occurs on the read-octlet transaction. 

The octlecs at addresses 4 and 5 contain architecture parameters Values are 

tl'n^f'S '"k"''''^'^'"'^'"-^ ^"^ implementor-independent efer to che 
des^na ted architecture specification for information. Addresses 4 and 5 are 
ead/only; an attempt to u-rite to these addresses may cause either rnormal 
termination or an error response. normal 

ocjiet 63 

t I architecture parameters not speciticd bu n^rh . 



* 1 architecture parameters not specified by Cerb eru.. 

64 

°"e'4rnal'&. °"T^ '^f^vice settings: Values in address 6 are changed only 

no spec^e^bv -C«^'' Cerberus devices. Bus 6 ..0 re 

no specified by Cerberus except by the restriction that these values are changed 
only by external devices, not by the device itself- r^f^r r« T- ? -^^""^j 
architecture specification for infoniadon. ' ° designated 

^vh^^^i °^ ^^'^^ '^^^''^ P<=rf°™ - device circuit 

reset, which is equivalent to the reset performed by driving the SD signal (to a 
zero value) for 33 or more cycles, and sets the device to an inSal statL b u Sch 
previous device state may be lost, previous control senLS may be £st and 

rZ"/"""^'-'''^ ' value'afrer whkh bits 63 

and 62 of the status register below are set (to ones). 

JI'TiiiT- u' °i ^ 'l^^'i^^ P<=rform a device logic 

clear, which inmalizcs the device to a known, quiescent, initial state, in whfch 
previous device state may be lost, but does not affect control register settS's 
Ltw s«Tco ^^'^^^ — -S.tfr 

Wtiung a one to bit 61. s. of ocdet 6 causes the device to perform a self-test. after 
which previous device state may be lost, and after which bit 62 of the catus 
register below is set to one) if the self-test yields satisfactory results. Bit 63 of the 
status register below is set (to one) at the end of the self-test 

octlet 63 62 61 60 ^ 

6 I r I c I s I o ther device settings not specHied by Cerberuil 

111 CI ' 



A 16.b.t fidd provides for the possibility of configuring devices which respond to addresses 

i'J^rl - iT T^"' ^'""^e the dividing line betw«n Cerberus ne, 

addresses and dev.ce addresses. Gateu-ay designers might want to consider this possibilitv 
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Ocilct / designates device status. Values in address 7 are normallv modified only 
by the device itself, except when an external device may clear 'status or error 
conditions; this register is read/write. However, the only valid data which can be 
written to this register is a zero value, which clears any outstanding status or error 
reports. Two bits of the first byte have standard meaning for all Cerberus devices. 
Bits 61. .0 are not specified by Cerberus except by the restriction that these values 
are modified only by the device itself except for clearing by an external device; 
refer to the designated architecture specification for information. 

Bit 63. of octlet 7 indicates whether the device has completed reset, dear, or 

self-test. 

Bit 62, s, of octlet 7 indicates whether the device has successfully completed reset, 
clear, or self-test. 

octlet 63 62 61 • q 

^ I c I s I other device status not specHied by Cerberu s | 

11 62~ ' 



Octlets at addresses i.2^^A are not specified by Cerberus. Refer to the 
designated architecture specification for information. 

Gateways 

The Cerberus bus may be extended into a network of buses using a gateway. 
Gateways connect between buses that use the wired-and signallfng protocol 
described above. A gateway attaches to a local Cerberus bus and receives and 
retransmits bus requests and responses over a linkage to other gateways, thereby 
reaching to additional Cerberus buses. This document does not specify the 
protocol used to link gateways. 

The diagram below shows a gateway network connecting several Cerberus buses: 
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Each Cerberus bus in a Cerberus netvvork may. tor specification purposes be 
assigned a unique netu-ork number, in the range 0..216.1. These network numbers' 
never appear directly m Cerberus device addresses, as the target netvvork bvte 

ner nu±r t ^''-^^ °^ ' ^"^""^ "^-"^ °nlv a relari" 

net number, the target net either minus, or xor'ed vvith. the initiator net Thus the 
relative target network address is always zero when the initiator and the targe; are 
on^the same Cerberus bus. and is alvva>-s non-zero when they are on Sem 

A Cerberus bus permits only one transaction to occur at a time. However a 
Cerberus network may have multiple simultaneous transactions, so bng s the 
target and muiatot network addresses are all disjoint. In more precise terms he 
network addresses must satisfy the relations: 

targec; :s: initiaiorj 
cargeti p targetj 
initiator; t initatorj. for all i * j. 

Lnslaionl bT;;?'-^ , "^^'^ ^^^i^iczive conditions for simultaneous 

bandSdth nf '^'^'^'^'^ ^"""^ °f performance or 

bandwidth of the gateway network. WTien these conditions are noi satisfied one 
or more transacnons may be selected to be aborted on the local Cerberus bus on 
which they are mxtiated by any fair-scheduling mechanism. 

Wh.n''f connected to the gateway network by exactly one 
K r^Vk; k '^fT ^^u^" °' "^""«ion is received by a gateway on a 
s nL Lrn ^''1-^^ °' 'Hf ^J'^" ^P^^'« ' """^^ef. If this bvte 

hi?r;.n«;r „ r^** ^'^ ''"^ '^^^^S""'^ g«ewav. must carry 

this transacuon across the gateway network. This number is interpreted as a 
signed byte, relative to the initiator gateway, and specifies a gateway to be the 
arget of the transaction, which w^e uill designate the target gatetay. We will refer 
S.. !n^?l, k bus to which the initiator gateway is attached as the initiator 

bus, and the bus to which the target gateway is attacked as the target bus. 

The request packet is carried via the initiator gateway, through the gateway 
net%vork to the target gateway, which then re-transmits the packet on the target 
bus. When the request packet is re-transmitted on the target bus. the network 
number byte is zero, designating a target on the target bus. The initiator gateway 
may delay transmission of the request packet on the initiator bus as required to 
limit or manage the flow of information through the gateway network, between 
each byte of the request packet. The initiator gateway must also delay 
transmission at the end of the last byte of the request packet in order to ensure 
that packet aborts on the target bus are propagated back to the initiator bus. The 
imtiator gateway must also ensure that a target device which responds just barely 
within the time-out limit on the target bus does not cause a time-out on the 
imuator bus. generally by asserting a delay on the initiator bus until this condition 
can be assured. 

When a response packet is generated on the target bus (which may be from either 
the addressed target or some time-out generator), the packet is carried in the 
reverse direcuon by the gateway network. This response and any further packets 
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arc carried unriJ the end of the transaction The rr^nr^r^rc « 
further packets are t,ot changed by the gateway network '"^ 

When a local Cerberus bus reset is received by a gateu av. the reset is carried bv 
tne gateway network and each other gatewav then re-rrm<:mirc a r».», , • 
on aU other local buses. transmits a reset transaction 

Repeater 

A Cerberus bus may be extended by insertine repeaters A reoeater elecrrinllv 
separates two segments of a Cerberus bus. but prov d s a a'nsTa rent S^^^^ 
between these two segments. Using a repeater is advantageous when the 
capacitive load or dock skew between Cerberus devices on a b?ge bus would 
require a reduction in the clock rate. The system designer must ensure diat devSe 
addresses remain unique across what is lo<.ir.lIu , l. " 



I 
j 

I 


Repeater 




i 


L i 






i 


SC 




SS 

SD SC 




SD 














1 




r ^ ■ \ 








Repeater 







, ' ' — , — ^w^uv^ji. pawfvci »ccH on one siae oi tne 
repeater on the other side, with a delay of at least one clock cvcle If two 
transactions appear nearly simultaneously on each side of the repeater the 
repeater must abort one of the transactions and permit the other to be repeated 
This arbitrauon must be performed fairly, such as by alternating which side of the 
repeater is preferred on consecutive collisions. 

A simple repeater continues untU the end of the transaction by repeating the 
response packets which may appear on the same or opposite side as the original 
request packet of the transaction. 

If the topology of the Cerberus is constructed so that only target devices e>dst on 
one side of the repeater, the design may be simpUfied by the elimination of the 
arbitratiori tuncuon. In such a case, transactions may only originate from the side 
designated to contain initiator-capable devices. 

A more sophisticated repeater may "learn" which addresses are on each side of 
the repeater, and only repeat transactions which need to cross the repeater to be 
completed. Alternatively, a repeater may be constructed with knowledge of the 
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addresses to be placed on each side, such as addresses 0..i27 on one side and 
addresses 128.. 255 on the other, again permitting the selective repeating of 
packets across the repeater. y e ^ 

Synchronous RGnnntfir 

A very simple form of repeater may be employed to divide up the capacitive and 
leakage load on the SD signal ot a Cerberus bus into two or more segments, when 
a common SC clock reierence is used. 



1 SD^-* ► 






► 


Synchronous 






Repeater 




SDn-« ^ 






SC to. 












ss 


Synchronous reoeater 
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For large networks, this repeater improves performance by dividine ud the RC 
delay by a factor of n, though two bus settling periods now <^cur on elch'sC cloc^ 
period, so die speedup is approximately j. 



361 



Case 2:05-cv-00505-TJW Document 149 Filed 10/15/2007 Page 5 of 40 

wo 97/07450 PCT/US96/13047 



This circuit can be economicaUy implemented using a sinale TTL '621 part and a 
pull-up resistor: - - k « •» 



SC. 



SD. 

SD, 

sd! 
so] 

SDj 
SD- 

sd' 



8 



'621 



GBA 
GAB 




500 Q 





n r 




A 2 B 2 
^3 B 3 
A4 B^ 
^5 B5 
^6 B g 
A7 

^8 B g 



























Eight-port synchronous repeater 
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Icarus Interorocessor ProtnnnI 

MicroUniry's. Icarus interprocessor protocol uses Hermes high-bandwidth 
Channels to connect Terpsicnore processors together, either direcdy or through 
external svntchmg components, permitting the construction of shared-memorv. 
cohercmly. or incoherently- cached multiprocessors. Icarus uses Hermes in the 
Dual-Master Pair configuration, and can be extended for use in "Multiple- 
Master Ring configurations. ^ 

Internal daemons vvithin Terpsichore perform and respond to Hermes write 
ZTaTa ^"'V' .'"^"P^ocessor communication protocol is 

Vl ' , generation of memory- references to 

emote processors tor access to Terpsichore's local phvsical memory space, and 
for the transport of remote references to other remote processors. ' 

Interorocessor Tnnnlnni>^c> 

The simplest multiprocessor configuration that can be built with the Icarus 
protocols is a dual-processor; 




Dual-processor Terpsichore with Icarus interpror.p.'isnr 



link 



The diagram below represents the same dual-processor system, in a simpler 



noration: 



Dual-processor Terpsichore with Icarus interorocessor link 



In the configuration above, a pair of Hermes channels arc connected together to 
form an Icarus Interprocessor link in the Dual-Master Pair configuration. A 
Cerberus bus connects all the system components together to facilitate system 
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conhguranon. The Terpsichore processors all run off of a common frcquencv 
clock, as required by the Hermes channek that connect between processors ' 









T 




T 









Dual-processor Terpsichore with Icarus interprorPj^Rr^r 



link 



A Terpsichore processor s dual Icarus links, each in the Dual-Master Pair 
configuration rnay connect to two different processors S c.ru 
Transponder daemons in each processor, several proSrs may be 
interconnected into a linear network of arbitrary size: ' 



Four-processor Terosichcra with Icarus interpror.fissnr links 



S.irrarsizi"''' " f'>^™i"8 



a ring or 



II 53 



Four-process or Terpsichore with Icarus interprocessor links 



In the configuration above two Icarus links are connected to each Terpsichore 
processor, forming a single ring. icrpsicnore 
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twork of 




Sixteen-processor Terpsichore with Icarus interprocessor links 



Iti the configuration above, two Icarus links are connected to each Terpsichore 
processor, forming a single ring. crpsicnore 

Other multidimensional topologies can be constructed bv using multimaster rings 

arriink oSfs Kl" f ""^ Terpsichofe processors has'n 

icarus hnk-pairs available for connection into dual-master or multi-master 
configurauons. For example, with a 4 -master ring: 



ii 



11- 



Four-processor Terpsichore ring building-block 
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These building blocks can then be assembled into radix-n sivitching networks: 
By connecting Icarus links to external suitching devices, multiprocessors u-ith a 

^^■"'^ ^" arbitraSonnecrL' 




Eight-processor Terpsichore with I carus links to ..witnh.nn f.h». 



Link-level and Tr^nc^^np on-lRvf,/ Prntr^r>r.i 

Two-packet link-PtnH on nomc^nnlsp ,rc^ 

b^tween^rS± "link-action- to describe the low-level packet protoco! used 
between a Hermes master device and a Hermes slave device. The packets that 
make up a Imk-accion contain a three-bit Unk-action identifier or "lid " vhfch 
permit up to eight outstanding link-acuons to be in progress at any point in time. 

Link-actions consist of two actions. Each packet transmitted on the Hermes ring 
corresponds to an action: vermes ring 

Request 
Response 



j the action taken by a re quester to start the transaction 

I the action t aken bv the responder to finish the transictioiT 
Link-action nomenclature 
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These actions and their relation to the data flow is shown below: 



Requester 



.Request 



Response, 



Link-action actions and data flow 



ResDonder 



Four-oacket tmn^^n tjcn nnmp^nni^p ,ro 

We designature the term "transaction" to describe the upper-level packet 
protocol ^ed when embedding a four-packet, or "split" transaction above the 
link-level Hermes packet protocol. 

Transactions are used when the latency of a transaction may require that more 
than eight actions are outstanding at a point in time, in order to maintain the 
desired throughput of the protocol. Embedding the transaction protocol above the 
link-aetion protocol limits the amount of Imk-lcvel state which must be 
implemented. 

Certain of the packets that make up a transaction contain an eight-bit transaction 
identiher, or tid. which permit up to 256 outstanding transactions to be in 
progress at any pomt in time. These packets also contain link-action identifiers 
hds which connect these packets u-ith others which are part of the transaction, 
but do not contain a tid. 

Transactions consist of four actions. Each action results in one or more Unk-level 
Hermes packets transmitted on the channels: 



^^'^"QSt ! the action taken by a requester to start the transaction. 



Indication ■ the reception of a request by a responder. 



Response ! the action taken by the responder to finish the transaction. 

Hnnf irmatirtn I tho rc^r-r-,r^¥',r^r^ ^( ♦Uxl 1_ i _ _ _ . 



Confirmation i the reception of the response by the requester 



Transaction nomenclature 
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These actions and their relation to the data flow is shown below: 



Requester 



Req 



uest 



Response^ 



Transaction actions and data flow 



Responder 



Transaction 
eve! action 



Request 



Indication 



Typical transaction 
niessaqe 



read/write-sizeiet- 
request 



Link-level 
actions 



Link-action command 



j Request i write-octlet 



Response 



Hemote-tndication i Respo nse ; write- rg^^^;^ 

r ri /%*rritr^ r^^-wM.!.* 11-1* ~ !!_ 



read/write-sizelet 
response 



Request 



write-octlet 



uontrrmation Remote^contirmatio n I Respons^ i write-resoonse 



ransaciion protocol for Icarus Requester Daemon" 
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Icarus Act/on FnrmRt 
Request and PR^^nonse action- 

i.nnnc! V^^"^"!!^'^^ ^^j^^ octiet Operations comprise an Icarus request or 
esponse action. The address ot the write operation contains target routine 
transacrion-id. commands and sequence information in the foUowing format: 

A remote request is a write ocdet to an address of the form: 
3J T6 15 8 7 0 

I node I tid I com [ 



16 



with data of the form: 
63 



octiet 



64 



] 



I-hh*'rhi'1^n,'r'^' 8-bit transaction id code which must be returned along 
yth the remote response. The tid held value muse be unique among all 

rom'H'[;nntr°"^!r'""^ u"" " ""^"^^^ ^^"^^ transactions oneinating 

irom distinct nodes may be equal. " 

The com field contains a 6-bit command code which, in the first octlct. designates 
the operaaon to be performed in a request action or the result returned in a 
response action. If the cominand code is in the range 0..31. in successive ocdets, 
the value of the com field mdicates whether the number of octlets to follow (0 9) 
such that the last ocdet of a message contains a com field uith a 0 value. 

The node field contains a 16-bit node address which is the target of the action. 
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When embedded into a link-level write octlet ooeraiion 
requester daemon request appears on the Hermes m the form: 

^ 0 



the Terpsichore 



wal 2 I Tid 



com 



tid 

node7„o' 



nodeis..B 



octletea-sfi 



octletss.da 



octlet47.,an 



OCtlet39..a9 



octiet3i..?A 



octiet23..ie 



octietis..a 



octlet7„o 



check 



8 

A transaction which has a payioad of one octJet must use a link-level u-rite octlet 
operation A transaction which has a payioad of greater than one octlet r^ a v 
successively use link-level write octlet operations to transmit the payioad. 

Indication and Onnfi mnatinn ;=rfinnc : 

Indication and Confirmation actions consist of a series of link-level write octlet 
response packets, one for each ocdet of the Request and Response actions. 

Icarus Requester O^nmnn 

When Terpsichore attempts a load or store to a physical address in which the 
high-order 16 bits are non-zero, the memory at that address is assumed to be 
present m the memory space of a remote Terpsichore processor. The Icarus 
Kequester Daemon is an autonomous unit which attempts to satisfy such remote 
memory references by communicating xmh an external device, cither another 
lerpsichore processor or a switching device which eventually reaches another 
lerpsichore processor. 

These remote references are characterized by an eight-byte physical bvte 
address, of which two bytes are used for specifying a processor node, and the 
remaining six bytes are used for spedfv-ing a local physical address on that 
processor node. 
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The Icarus Requester Daemon associates each remote memorv reference vnth a 
ITTr:: ol c.,k^ bns. permitting up to 256 such rerSo" rrference: 

mayycTrat\rn"Z '"'P'— -^'^ Terpsichore 

The Icarus Requester Daeiiion takes the role of the Transaciici^ Requester, and 
an external dence takes the role of the Transaction Responder. The daemon 
generates writes to a specified byte-channel and module address, which causes an 
ThrH!i.^r" '° I °' °^ cache lines in a remote memory-, 

poinrfn tim? '^""^-^""^ "'"^^ 

7^?'''*;''^" requester daemons which act coticurrentlv to tu-o 

different byte-channel and/or module addresses. 

Icarus RRSDonciRr Daemon 

Io^^.!!;""i/"''°"u "L°'^{^°" ^'"P'' f^O"^ ^ ^P^ified byte-channel and 

module address, which enable an e.sternal device to generate transaction requests 

Llpri T ""^ "^hc Imes in the Terpsichore's local memory or to 

generate Terpsichore events. The daemon also generates link-level writes to the 
baTk tl''th?«teSVevice"'"" responses to these transaction requests 

Tcpsichorc contains two such responder daemons which act concurrcndv to two 
ditterent byte-channel and/or module addresses. 

An external device takes the role of the Transaction Requester, and the Icarus 
Responder takes the role of the Transaction Responder. 

Icarus TransDondRr DaGmnn 

The Icarus Transponder Daemon accepts writes from a specified Hermes 
channel and module address, which enable an external device to cause an Icarus 
Requester Daemon to generate a request on another Hermes channel and module 
address. 

Terpsichore contains two such transponder daemons which act concurrently 
(back-to-back) between two different byte-channel and/or module addresses. 



58The term "sequence number' is avoided here, because the transaction-tags are not necessarily 
sequential in nature. 

"The number of link-level requests to be outstanding is still under study. 
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Icarus Rrcii if^^t 



The rolIo.-.„g table summarizes me commands used for Ic.rus requests and 



responses (response command shown in bold): 
CDC6 



1..9 



ccmmana 



las; ccriet o? fr.uli!-cc::e: ccmm^nn 



payfoad 
(octlets) 



2^ 



26 



Reserved 




read inconerent weak cacne^iinp 



write response 



read allocate stronn ccnei 



reap noailocate sircr.c cciier 



read ailocate weak cc::e; 



23^ 
o 



read noailocate weaK cctiet 
read allocate strong r.^ytpr 



30 



31 



32 



33 



34 



read noailocate strong nexlei 



read allocate weak hexiet 



read noailocate weak nexlet 



read hexiet response 



read inco herent cache>ltne r^^^^^:;;::^ 



read coherent cache.|ine response 



38 



39 



40.. 51 



Reserved 



read coherent strong cache-line 
Reserved ' ' 



read coherent wea.K cache-line 



Reserved 



52 



53 



write coherent strong cache-line 



write incoherent strono cache-linft 



10 



write coherent weak cache-line 



10 



56 



57 



58 



59 



60 



write incoherent weak cache-line 



write allocate strong octlet 



write noailocate strono octlet 



write allocate weak ociiet 



write noailocate weak octlet 



write allocate strono nexlet 



61 



write noailocate strong hexiet 



62 



63 



64 



65 



66_ 
67 



write allocate weak hexiet 



write noailocate weak hexiet 



add-and-swap allocate strong octlet little-endian 



ado-and-swap noalloc£-3 sirong octlet nnie-Pnrt.An 



add-and-swap allocate weak octlet little-endian 



aoo-ano-swap noailocate v^esK octlet iittle-enaiari 
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Reserved 




on 
OU 


aad-and-swao aiicca:e sr^cra octie: oia-enaian 


2 


O 1 


3dd-and-swap ncailccate srrona cct;er biQ-^nd'an 


2 


r\ 

3d. 


add-and-swao ailccare v.eaK ccriet c:a-ena(an 


2 


83 


add-and-swao ncanccare //sak ocuei oiq-endian 


2 


o4 


comoare-and-sv/ao aiiccate strcna octlet 


3 


c :i 
oD 


comDare-and-5-,va3 roaiiocate s;rona oc:let 


3 


Q 

oo 


compare-and-swao aiiccate weak oc:i*=^t 


3 


o / 


comoare-and-swac rcailocate weak occtet 


3 




muitiDlex-and-swao auccare strona octl^^t 


3 


89 


multiplex-and-swao ncai:ocate strona octlet 


3 


90 


rill 11 1 in I0 Y-3 rk^~ c\A/s o — •i^'^o-'- — . 
iiiuiii^/ieA aiiu-swcSU cii:j*.,ais W63K OCtlSi 


3 


91 


mulliplex-and-svvao noailocais weaK octlet 


3 


92 


multiDlex allocate strona cctlet 


3 


93 


multiplex noallocaie sircna octlet 


3 


94 


multiplex allocate v/ssk octlet 


3 


95 


multiplex noallocaie v/saK octlet 


3 


96-255 


reserved 





Icarus Reaussi commanas 

A remoie (add.svvap.or.andl octle: request 'is data of the form: 
63 



address 



bytes 0..7 



64 



A remote read incoherent {strong.weakl cache-line request is data of the form: 

63 



address 



J 



64 



A remote read coherent istrong,\veaki cache-line request is data of the form: 



63 



address 



coherence tag 

64 
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A remote write incoherent cnche-1 



ine rejuei: is djta ot the t'orm; 



address 



bytes 0..7 



bytes 8..15 



bytes 1 6..23 



bytes 24,.31 
pytes 32,.39 




A remote write coherent cache-lme request is data of the form: 

address 



bytes 40„47 
j>ytes 48.755" 



bytes 56..63 




A «mote read (allocarc.noallocatel (s.rong.weakl ocdet requcsc . data of tr 

address 



tne 




Lm"""^^ ^'"'^ «aUocate,noallocate} tscrong.weak) octlet 



request is data of the 
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A remote read lallocate.nodlocate! isrrong.vvcaki he.xJei request is data of the 



torm: 

63 . ■ 

I address 



A remote vvrue bllocnte.noallocate} {itrong.u-eakl hexict request is data of the 

53 

adtiress ~ 

bytes 0..7 

bvtes a..is ' 



Icarus Indinntinn 



An carus Indication consists ot a link-level write -response packet tor each hnk- 
level vvnte .ssued as an Icarus Request. Each iink-ievel write-response packet 
Sk Stumirir ''"^-'""^^ -rue.requesr packet. This senes botrthe 

evti rel«T«n5 '""'"^ ' 7?°"^^ ,^"^ the ability to receive additional link- 
IkifLT • indication of receipt of the request and the 

ability to receive additional transaction level requests. 

Icarus RRSDon<^R 

Icarus Responses consist of a series of one or more link-level write-ocilet 
operations. The lowKJrder bus of the addresses of the vvrite operations contain 
commands and tid information, and the data is the contents read from memory. 

The ocdet stream contains transaction-level responses from the Terpsichore 
Kespondcr daemon, which are summarired in the table below: 



com 


command 


payioad 
(octlets) 


0 


termination 




1.,9 


continuation 




10..22 


Reserved 




23 


write response 


1 


24..31 


Heserved 




32 


read/add/swap octlet response 


2 


33 


read hexlet response 


3 


34 


read incoherent cache-line response 


9 


35 


read coherent cache-line response 


10 


7-255 


reserved 
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Th 

previously. 



e com neld concains an S-bit .-.esj^^e c 



ommand. as given in the cable 



cransac::or: :d code used in the request messaae. 



The lid field contains the S-bit 

The node field contains the 16-bit Drcc-s>or n..mk., j • l 

.-r.c-..^r number used m the request message 

.\ remote !read.add.s^.•api ocrlet response is data of the form: 




A remote read hexlet response is data of ;he form- 
63 




A remote read incoherent cache-line 



response is data of the form; 




A 



remote write response is data of the form: 

63 
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A remote read coherenc cache-line response is data of che form; 



coherence tag 



bytes 0..7 



bytes 8..15 



bytes 16..23 



bytes 24..31 



bytes 32..39 



bytes 40..47 



bytes 48..55 



bytes 56..63 



64 

A remote write coherent cache-line response is data of the form: 



coherence tag 



Icarus Confirmntinn 



An Icarus Confirmation consists ot a link-levei wntc-response packet for each 
link_ level write issued as an Icarus Response. Each link-level vvrite-response 

bn t ""^"^ ^'"^-^^^'^^ write-request packet. This lr^:s 

both the link-level purpose of issuing a response and indicating the abilirv ro 
receive additional l.nk-level requests and a transaction-level confirmation of 
receipt ot the response and the ability to receive additional transaction-level 
reauests. 



requests 

Deadlock 



The Icarus Requester. Responder. and Transponder daemons must act 
cooperatively to avoid deadlock that may arise due to an imbalance of requests in 
the s>-siem which prevent responses from being routed to their destination. 

The requirements vary depending upon the characteristics of the svstem 
configuration, and the mechanisms for deadlock avoidance are still under study. 

Principal mechanisms to employ are cycle-free-routing of requests, and the means 
to prioritize responses above requests in foru-arding priority. 

Error handling 

The liiik-level packets contain a check bvte which is designed to detect single-bit 
transmisstion errors in the Hermes channel. 
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NK-hen either party in an Icarus transaction recdves a packer ^vi^h . u i, 

register. Terpsichore then wH] -!.^r ,r^''^^ operanon in a Cerberus 

link-level transactions. E^^^ P=^"y then re-issues any outstanding 

The contents of the address field in tKa i i 

the error handling mechan/sm does ror r^^' "^^'^ '° ^"^"^^ ^^at 

This .s importanf. becau^ llike T . / '''l ""'"'"^ V '.'^''''"^ °P^-'^"°"^- 
protocol contains non-idempotem ope-at ts P'"*'"'"^' »"""«ion-level 
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Fixed-point AooiiCRfinn^ 

F-nc Mcst-!=--nnif;csnr Qr^ 

The tollowmg example performs a • fmd mosr-siamticant one" operation on 
general register ra. placing the result m general register rb. 



E.ALMS rb.ra 

■-■rd Ls=^:-siori-'rant Chb 

The toDouing example performs a find least-signUicanc one" operation on general 
register ra. placing the result in general register tb. ^ 

E.ADDl rb.ra.- 1 

E.ANDN rb.rb..'-a 
E.ALMS rb.fb 

Floatino-n oint Annlin^nnnc, 

The following example demonstrates the mner loop of the Unpack benchmark. 
This section is under construction. 

Dioital Sional Prnnn^ sino Annlin^tir^nc 

This section is under consrruciion. 

Image Pr ncessino AooiioafinnR 

The followmg examples demonstrate several applications, listed below in summary 
torm with the pertormance estimated. The estimates assume singlc-cvcle loads 
and stores, that is. they do not account for losses due to cache misses. However 
the memory reference patterns are ver>- uniform, and with prefetching, thev could 
be kept invisible. 
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pixels per 

cycle 



3x3 Fiitering of Monochrome imaoe 



3x3 Fiitenno ot Color IrrTaoe" 
Conversion of Mcnocnrome irr.sc; 



0.8 



io Color 



0.2 



ucnversion ot color imaoe to Mnr.nrr.r^ 



uecimanon or Monccnrome Imaoe" 



5-poin. venicai uecimation of Mrnrmmm ^ 



0.9 



cimation of Monochrome Image 
5-point nortz ontai Uecimation nr (\^.r. r irr.— 



0.5 



5-coint Vsruc ai. uecimation ot Coir.f ima»r 



3x3 Decimation of Color Imaoe 



0.3 



^!iterrc! of Mnnnr-hrorrR /-^.-a 

kOO kOl k02 
klO kll kl2 
k20 k2l k22. 

vo.d ^^°"o^hromeFilter(int8 -src. in,8 'tis;, int row. int pcount. 

intS kOO. intS kOl. intS k02. " >. 

int8 klO. intS kll. int8 k12 
intB k20. inte k21. intS k22> I 
for (i=0; i!=pcoun!: | 

clst(i) = (src[i^ow-1J-kOO V 5rc(i-rcvv]-kOl * srcr,-row^l]-k02 * 
src .-1J-klO + srciO-k-, 1 ^ src(i+l]-kl2 + 
^ src{.*row-1]-k20 * src[i^row]-k2l * srcii+r.3w+l]-k22)»8. 

( 

We now examine the assembler coding of the inner loop. Because there are ei^ht 
pixels m an ocdet. the mput size of the mult»plxcr. this loop filters eiS pixds a 
once. TTie coefficients are placed in 9 registers symbolicallv named kOO k22 vSh I 
copy o coefficient m each byte of the register. We assume That the coeffSfnts 

sTeSaUv'he^imTf I'^'k'T" '^ ^ -teger accumS^^^ 

ic J? ,1, ■ L ,• ^'^fo'"" ^-^"es of the coefficients <= 256. The row size 

pointers """" "^-^"^ Registers 8.9.and 10 contain array 



A.SUB 
L.64 

G.MUL-8 
L.64 

G.MULADD.8 
L.64 

G.MULADD.B 



r2.r8.rovy 

r3.-l(f2) 

r4.r3.k00 

r3.0{f2) 

r4.r3.k0l.f4 

r3.1(r2) 

r4.r3.k02.r4 
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L.c4 r3..1(^ = ) 

G.MULADD.S r4 r3.k:0 

G iMULADD.6 r^.r3.<l rr4 

G .V1ULADD.3 -'-.r3.k:2.r4 

- ADD f2.r8.rcv/ 

L c4 r3.-l(r2) 

G MULADD 3 M.r3.k20.r4 

L^-^ r3.0(f2) 

GMULADD.3 r4.r3.k2Vf4 
r3.1{r2) 

G.tMULADD.3 r4 r3 k22 r4 
G.C0MPRESS.16 r4,f4.3 

S 64 r4.0(r9) 

A.ADD rS.S 

A.ADD r9.8 

BNE r8.r1C.lb 



)uf in 10 o cl ^^^^^^""S address comput.rion instrucuons. this can 

can be Led\o h.nT™''" ^^"^le-cycle atency for G.MULADD. Loop unrolUne 
can be used o handle greater latency. The inner loop is 10 cvcles per eiaht pixeh 

Filtering n f Cnlnr .'m^ ng* 

; T^^- r r"?' '^"^ the image is made up of pixels each 32 bits in 
size 8 bus tor each of red. green, blue, and alpha. We treat each component 

"rthe^coder^ ^^^^"'"^ " '^•^-^^ c 

void CoicrFr(tGr(int8 -src. int8 -dst. ini row. ini pcount ' 
iniB kOO. intS kOl. intS k02. 
int8 klO. iniS kn. intS kl2. 
intS k20. intS k2l. IntS k22) ( 
for (i=0: i!=4*pcount: ( 

dS![i] = (SfC|i.row-4J-kOO ^ src[i-rowl-kOl * src(i.row^4]-k02 + 
src[i-4J-klO + srcnrktl + src(i+4]*k12 ^ 
^ src(i+row-4]-k20 + src[i+row)-k21 + src[i+row-H4]-k22)»8: 

) 

The assembler coding of the inner loop is: 

^' A. SUB r2,r8.row 

L.e4 r3.-4(r2) 

G.MUL.8 r4.r3.k00 

1-64 r3.0(r2) 

G.MULADD.8 r4.r3.k0l r4 

L.64 r3.4(r2) 

G.MULADD.8 r4.r3.k02.r4 

L.64 r3.-4(r8) 
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G .VtULADD.8 M.r3.kV0 

G.MULADD.a ■-:..'3.'-<;VrC 

• L r3.J.'r5) 

G.iMULADD.8 M r3 ■^•^ -4 

A.ADD r2.rS.ro-.v 

G .MULADD.3 r4 r3.k20.M 
r3.Zi:2) 

G 1MULADD8 rd.r3 k2l '4 

l-^^ r3.4(;2) 

G MULADD.8 r4 r3 k22 r4 
G.COMPRESS.123 r4.r4.3 
S.64 0(r'^) 

A.ADD r3.8' 

A.ADD r9.8 

^•N^ rs.no.ib 



L^rr^'V^" l"""^ algorithm 3s for the color ima^e above Operaiions are 
CcnyorRi on of hAnnochrnri-,^ tc Onlnr 

To convert a monochrome image to a color image, u-e must rriplicate each 

vord MonochromeToColof(int8 'src. int8 'dst. m pcount) | 

tor {i=0: i!=pcount; i-t-+) | 
clst[i] = sfc[i]: 
dst[4-i+ij = srcm; 
dst(4-,+2)= src(j]: 
dstf4-i+3j = 255: 

I 

Se\d o±.i" t ^""^^^'^"S inner loop (addressing operations and loop 
overhead omitted - they do not influence the operation count): 

1: L.64.B r4.0(r8) 

G.SHUFFLE.16 r2 r4 r4 

r ISnEn 1 «^ '^■'^'■'^ contains -1 

G.SHUFFLE.8 r6 r2 f8 

G.SHUFFLE.8 r8.r3.'r9 

S.128.B f6.0(r9) 

S.128.B r8.16(r9) 

A.ADD f8 8 

A.ADD r9.32 

B-NE r8.rlO.1b 

The above sequence is 4 cycles per 8 pi.xels. or 2.0 pixels/cycle. 

void MonochromeWiihAlphaToColOf(int8 'src. .ntS 'alpha. .nt8 'dst. .nt pcount) | 
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= Z: i! = pcoun(: n--^-) | 
dS-.U] - srcfij: 

= src[i]: 

cs;[^*»-..3i = alpnafi]; 



Which results in the following inner Ioqd: 



L.64.B 
L.64.B 

G. SHUFFLE. 16 

G. SHUFFLE. 16 

G.SHUFFLE.3 

G.SHUFFLE.3 

S.123.B 

S 123.B 

A ADD. 

A.ADD 

A. ADD 

6NE 



r4 0(r8) 

r5.0ff9) 

r2.f4.f4 

r3.r4.r5 

r6.r2.r8 

r8.r3.f9 

r6.0{f10) 

rS.ISCnO) 

r8.8 

r9.8 

no. 32 

rS.m.lb 



The above sequence is 4 cycles per 8 pixels, or 2.0 pixds/c>xle. 
CcnverRion of Onlnr tr, Monrr^.-^mrr,^ 

Ir''..?"'Tki '""^^^ ' monochrome image, a ueighied sum of the red 

ni^i iLu-riti-^K^ - <^^b, SO overflow does not occur The r^Qulrino 

vo.d ColorToMoncx:hrome(in.8 'src. intS 'ds:. int pcoun,. .n,8 kO. ,nt8 kl. .n,8 k2) ( 



for (i=0. i!=pcount: ( 
dst[i] = (Sfc(4'i]'k0 • 



src[4-i^l]-ki + src(4-i+2]-k2)>>8: 



^Tiich results in the following inner loop: 



L.128.B 

G.DEAL-16 

L.128.B 

G.0EAL16 

G.DEALB 

G.DEAL.8 

G.MUL.8 

G.MULADD.8 

G.MULADD.8 

G.C0MPRESS.16 

S.64 

A.ADD 

A. ADD 

B. NE 



r2.0{r8) 

r2.r2.f3 

r4.16(r8) 

r6.r4.f5 

r2.r2.r6 

r4.r3.r7 

r6.r2.kO 

r6.r3.kl.f6 

r6.r4.k2.r6 

r6.r6.8 

r6.0(f9) 

r8.32 

r9.8 

r8.r10.1b 



#k0k1,..k0k1k200...k200 

#k0k1...kOk1k200...k2O0 
#kOk0...kOkOk1kl...k1kl 
#k2 k2 . . . k 2k20000. . .0000 



#toss away low precision 



The above sequence is 8 cycles and writes 8 pixels, or 1.0 pixels/cycle. 
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The code below pcrfonns the same action, but also sav« rk. i u i • 
second destination array. ^P^^ ^'^"c into a 



array 

void ColorToMonochrome(intB •src. ini8 "dst intB -Ainh^i 
inlBkO. ini8kl.int8k2)l 

int i: 



irti pcouni 



for (i=0. i!=pcount: { 

dst[ij = (src[4-i)-ko + src(4N*iJ-ki + src{4-i*21-k2^»fl. 
alphafi] « src(4-k3); ^ »^)»8. 



Which results in the following inner loop: 



L128 

G.DEAL16 

L.128 

G.DEAL16 

G,DEAL.8 

G.DEALB 

S.64 

G.MUL.e 

G.MULADD,e 

G.MULADD^ 

G.COMPRESS.16 

S.64 

A.ADD 

A. ADD 
AADD 

B. NE 



r2.0{r8) 

r4.16(f8) 

rSjAjS 

r2.r2.r6 

r4.r3.r7 

r5.0(r10) 

r6.r2.k0 

r6.r3,klj6 

r6.r4.k2.r6 

r6.r6.8 

r6.0(r9) 

r8.32 

r9.8 

no.B 
r8.r10.lb 



#k0kt...k0k1k200...k200 

#kOKi...kOk1k200...k200 
»kOk0...kOkOk1kl...kiki 
«k2k2...k2k20000. ..0000 



#toss away low precision 



The above sequence is 8 cycles and xrrites 8 pixels, or 1.0 pixels/cycle. 
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Image uarping is the general process of sclectivelv «r<.r.k; j l • , 
•mage to make it appear to fit into a new .hi • '""^^'"^ ^"d. shrinking an 
sphere, or drawn onTsurface ha^ s tilted lit^' '"'^ ^? "'"ched around a 
principal data structure used o Uncr !e uch '^^"(l '° "''^"'"S surface. A 
copies of the image, as shown tn cL rgram"belou tIM T ^f-"^^ 
because interpolation of the elements of tK;?. ^'^"^ particular value 

antialiased spatialiy-warped image Note tL the t3'? ^ 
aKvavs exact V four times lareer rh;.n rkl l i '°"i,size ol this structure is 

^he image decimated m ".thX he x o y SSnl^n^K ^^T."^""^ ^ "^^^ 
and smaUer going right and down m thJ . ' f ""^S" g« smaller 

The originailmale need not be square ^r hav^ sues tt:r'"''' ^'"^-'^ '^^^^ 

structure. ^ ^ '"^^ are powers of tu-o tor this 




In the secuons below we explore two parts of the problem, the creacion of the 
rray contaimng this decimated image, and the antialiased selection ofTems In he 
table. These are the parts of the process which must be performed in reTtime for 
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real-time application ot this process the cr^,rio - l 

be precomputed. and are a tunction t^e re 'Z °' '"'^'"-^ ""'^^ 

" c renaenng system used. 

P^cfrr^-^nn of A/V^r^^^^^-.^ 

Ss^Sl|r;K^^^^^ can be dn.ded into t.o 

direction only. The former generates all the b o.!.' ^l^''^^""? 'he vertical 
image and the latter generate the re-iL.no hlrl l'^^' 
This divides the problem .nto tv4 pTr/s eLh 

'.vhich IS a great advantage because "he amounr of " °"^.-d'mensional filtenns. 
y th the si2e ot the filter function ra her thnn g^^""^ I'nearlv 

dimensional filtering. "^"^ quadratically. when using two'- 

?ppZ h^&£l^s^£^:tT^i°^:^f f^-- use a 5.po.nt filter, 
selected so that k0+kl-fk2+k3.k-l ! 256 o ov ^h«« ^'eights. k0..k4. are 

^mghted sum is truncated, rather than round/d ' °" ^^'^ 
overflow. rounded, again, to avoid the possibility of 

v=ia HorizontalOec.mationMonochfcrr-,6(in-° o . 



int i.j.k 

for {k=0..=0: k'=pcounf ) i 
'or (1=0; |!=drow; j++) | 

c^stlk.., = (src(,-2rk0 . src(MJ-., . 3rc{.]-K2 
^'<^l'+ll ^ * src(.*2!-k4 )»8: 

I 

i+ =srow.drow-c)row; 

I 

Which results in the following inner loop- 
GoL « '6.-2(r3) 

G.MUL.8 r4.r6k0 
G.MULADD.8 r4.r7:k?.r4 

G.MULADD 8 r4 r6 k2 r4 

G.MULADD.8 r4,r7.k3:r4 

^rsA r6.2(r8) 

G.0EAL.8 r6 f6 r7 

G.MULAD0.8 r4 r6 k4 r4 

G.C0MPRESS.16 r4[4 8 

aa'dd [t%^^ 

A.ADD rg'g 

^'^^ rs'no.ib 



pixels/cycle.) *^ « 6 cycles per 8 pixels, or 1.3 
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VX hen decimating ,n che ven.cu] d.reciion. the race is even hi.her still- 

vc.d VerticalDscimaiicnMcnoc^iro.-r.c.int3 Stz r.i 9 -n, cr«. ^ - 

intS kO. .nt8 k1. inte k2 n^S ,r-a k" - " ' ' '"^ PCOuRi. 

!nt i.j.k; ■ ■ ' ' 

(cr(k=0.;=0. k!=pcount; ) I 
!cr (j=0 |!=aro-.v: ( 

d3t[k..i = (src[i.2-s:.vvj-.C . s:c[.srowJ-^; . 3rc[irk2 . 

I 

'f=srovv+srovv-drow: 



Which results in the following inner loop: 
^^^^ r2.r8.fcv/t2 



L.64 



r3.0(:2) 



G.MUL.8 r4:;3.k0 



A.SUB 
L.5a 



r2.f8.rov/ 
r3.Cfr2) 



G.MULADD.8 r4.r3.k1 r4 

i;-^^ r3.0(r8) 

G.MULAD0.8 r4.r3.k2.rd 

A.ADD r2.r8.row 
r3.0(r2) 

G.MULA0D.8 r4.r3.k3.r4 

A.ADD r2.r8.fowt2 

t^"^ r3.0(r2) 

G.MULADD.8 r4.r3 k4 r4 
G.C0MPRESS.16 r4.r4 8 ' 

f'Alr> r4.0-8(r9) 

A.ADD rS 8 

A.ADD f9 8 

^•N^ r8.r10.1b 



Itr^e^s'^iiS ?r2'p£2!U"' ^ --^^ ^He rate is 

To generate the decimated array shown above, for a n2 imaee n2 niv.lc 

n2 '(lT9.2^ //f* 2 ^^'^ ^^1^"= "2/0-9 + 2n2/l 3 = 

Mcydes = ° ^l^"'' ^ ^^^^^ -^8^ can be decimated in 2.8 

dirLo:°>&S'lLs"m:lTe"m^^^^^^^^^ '"'"".^ ^7""' '^--ontal 

direction, it pet hs th^ie oflte/f^L^?o;' llchT""'^ '^"""^"^ "'^'^ 
For this example we assume V 2-7 ^^^^ ""^'^^ .'^^ "^t factor mto two pans. 
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ini3 kVO. ,nt3 kll. .nt8 kl2 
'ni3 K20. intB k21. intS K22) ! 

for (k=:0.:=0. k'=pccunt: ) { 
^cr (i=0. i!=dfovv: { 

,^=2 <21 ^ 3rc[!-3fD.v*l]-k22)>>8 

I 



■^=2*(srow-drow): 



Assembler code for inner loop: 

f'f^^ r2.r3.srow 

i:-)23 r6..i{r2) 

G.0EAL3 r6.r6r7 

G MUL.3 r4.r6.kOC 

G MULADD.8 r4 r? kOI r4 
r6.1(r2) 

G.DEAL.3 r6.r6 r7 

G MULADD.8 r4,r6.k02.r4 

i;''28 r6.-l(r8) 

G-DEAL3 r6.r6r7 

G.MULAD0.8 r4.r6.k10r4 

G. MULADD.8 r4 r7 kirr4 

G. DEALS r6.r6.r7 

G.MULADD.8 r4.r6.k12f4 

A.ADD r2.r8.srow 

^128 r6.-1(r2) 

G.DEAL8 r6.r6.r7 

G.MULADD.8 r4.r6 kOO r4 

G.MULA0D.8 r4.r7.k01.r4 

L-128 r6.Ur2) 

G.DEAL8 r6.r6.r7 

G.MULADD.8 r4.r6.k02 f4 

G.C0MPRESS.16 r4.r4.8 

5-64 r4.0(r9) 

A.ADD r8.16 

A.ADD r9.8 

B-NE r8.r10.ib 



.0 r .t^i^s ?f ^^^^^ 

because the .nde.. n^ulupher required the addifionaJ DEAT^Ldons to be 
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/Qd'^S-r-r of Cnlnr -r-sOR 



selected so that k0-k1!k2^k5-k4 - tl ' 

udghted sum s truncated ralher^hln '^n 7^^°" ".""J' "^^^ 

ovcrtlou-. rounded, again, to avoid the possibUity of 

• 'nt8 KC. .nt8 k1. .ntS k2. intS k3 n'-s'kafr ' ^ * 

for (k=ar=0. k' = pccunt: ) ( 
*cr (j=G; |!=drovv: { 

src(i-.aj-k3 > srcii.ark^ )»8 ^ 

3St(k^-i » (src(i-8)'kO ^ 5:c[,-J|-;<i ^ srcf.rk2 ^ 
src[i+ai'k3 ^ s":.*3)-k4 )»8 ' 

ss:[k.*l = (src(,.8]-K.O . c::;,^j.;,i * src(.)-k2 * 
src[ii-4]-k3 * 3rc[i-3)'k4 )»a- 

dsi(k++i = (src(i-8]-kO * srcii-4j'kl + srcli]-k2 + 
Src[i+4J-k3 + srcf!'+3j-k4 )»8 • 

( 

J i + = 4 ' ( srow+ srow-drow-drow); 

I 

Which results in the following inner loop: 

^' J;.128 r6.-8(r8) 

G.DEAL32 r6.r6 r7 

G.MUL.8 r4.r6k0 

G.MULADD.8 r4.r7.kl f4 

^-128 r6.0(r8)' 

G.DEAL32 r6.r6 r7 

G.MULADD.8 r4.f6 k2 r4 

G.MULADD.8 r4.r7.k3 r4 

^-•'^28 r6.8(r8) 

G.DEAL32 r6.r6 r7 

G.MULAD0.8 r4.r6 k4 r4 
G.C0MPRESS.16 r4.r4!8 ' 

S64 r4.0{r9) 

A-ADD r8.16 • 

A.ADD r9.8 

BNE r8.rl0.1b 

s^^ii^'^ixdf ^-delfp:: r ' ^'^'^'-y^-^ -hen the fUter kernel 

pixels/cycle ) ^ ^ "^^^"^ ^ P'^^^^- " 0 > 

When decimating in the vertical direction, the rate is even higher siiU: 

void VerticalDecirna|ionColor(jnt8 'src. int 8 "dst. ini srow. ini drow int pcount 
intS kO. mte k1. »nt8 k2. intS k3. intS k4) \ pcouni. 
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i.j.k: 

*Cf (k=C.;=G. k'=pccuni: ) ( 
•or (j=0. j'=4'cjrow; j+^j 



cJ5;{k..] = (src[.-3'srcv.:rkC - 3rc(,-i-srowi-k- , srcf:]-k2 . 

srcl.^.4-s:c-.v;-k3 . S.-=[,.3-3rcwrk4 )»6. • 



! 

f-r=4-(3rc*.v-srcw-drov. i 



Which results in the following inner loop: 

' A. SUB r2.rS.rov/t8 

r3.0fr2^ 

G.MUL.3 r4.r3 kO 

ASUB r2.r3.rc//t4 
r3.0(r2) 

G.MULADD.8 r4.r3 kl r^ 

G.MULADD.8 r473.k2.r4 

'^•ADD r2.r8.'c\vi^ 

r3.0(r2V' 

G.MULADD.8 r4.r3.k3 r4 

AADD r2.r8.rowtS 

^-^^ r3.Q{r2^ 

G.MULADD.8 r4.r3 k4 r4 

G.C0MPRESS.16 r4.r4.8 

S.64 r4.0-8(r9) 

B.NE r8.n0.lb 



To generate the decimated array shown above, for a n2 image. n2 pixels are 

dLe«Ton 'n '""^ P"^'^ generated in the vertical 

direction. Using 5 pixel filter lunctions, this takes: n2/0 7 . 2n2/0 5 - 

Mclli^^''''^ = » can be decimated in U 

The last e.xample in this section decimates a color signal in both directions 

CdTall J' ^.""^V/^^. in each direction, and a x3 fSe 

kernel. Real applications of decimauon may use larger filter kernels, but this size 

5x.:°a1Inrd?oV<tx^ ^ 

void DecimateCofor{int8 -src. int 8 -dst, .ni srow. int drow. ,nt pcount 
int8 kOO. rntS kOl. int8 kG2. 
im8 klO. intS k1l. int8 kl2. 
intS k20. iniB k2l. inl8 k22) I 
int i.j.k: 

for (k=0.i=0: k!=4'pcount: ) { 
for (j=0: jlsdrow: { 

"^^^^^VhTa^^^^^^ " src[i.4-srowj-k01 + src(i.4-srow*4rk02 * 

src[i-4J-klO + Sfcfi)-kn + src(i+4]-kl2 + I ^^^^ * 
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a.v-ll-4j -r 3.^.:, ♦ SfC|l+4j-k12 

src[i..4-3rc-.v-4|-k2C. - 3.-cii,i-5rowJ-K2l * src{.-^-5r=v/,^!-k22)»8: 

"""^iSiS^^'"'!^^^ * s.c(,.4-srcw]-kD, . src(i-4-sro....)-K02 . 
src I-4J Klo s.,i:j .<i , X src[i+4]"k12 + 

src[.^4-3rGw-a;-k2C - 5rc{i^^-5row]-k2l ^ src[..4-srcw.dl-k221»3: 

src[.-4]-kl0 - srciij-kn src[i+4]-kl2 + i ^ .... j kl^ + 

^^^^src(i.4-3row-4]-k20 . 5rc[..4-srow]-k21 . srcf..4-5rcw^4rk22)»3: 

I 

»i-=4'(srow-i-srow-drcw-drcw;. 

I 

Assembler code for inner loop: 

1 A.SUB r2.r8.3r::v/ 

L-128 r6.-4(r2^ 

G.DEAL.32 r6,r6 

G.MUL.8 r4r6kQ0 

G.MULADD.8 r4 r7 k01 '4 

L.I 28 r5.4fr2) 

G.DEAL.32 r6.r6 

G.MULADD.8 r4 f6 kG2 f4 

LI 28 r6.-4{r8) 

G.DEAL.32 r6.r6 

G.MULADD.8 r4.r6 klO r4 

G.MULADD.8 r4.r7.kn'r4 

L.I 28 r6.4(f8) 

G.DEAL.32 r6.r6 

G.MULADD.a r4.r6.k12 

A-ADD r2.f8.sroy/ 

L.128 r6.-4(r2) 

G-DEAL32 r6 r6 

G.MULADD.8 r4.r6.k00 r4 

G.MULADD-8 r4. r7.kOl r4 

L128 r6.4(r2) 

G.DEAL.32 r6.r6 

G.MULADD.B r4.f6 k02 f4 
G.C0MPRESS.16 r4.r4.8 

S.64 r4.0(r9) 

A.ADD r8,16 

A.ADD r9.8 

B-NE r8.r10.lb 

After some reordering of the address calculation instructions, the inner loop is 16 
cycles per 2 pixels, or 0.12 pixels/cycle. 

Fractional InteroolatiQn 
This section is under construction. 
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The tollowina examples demonstrate kev portions of TPEn .,nA \WT=r 

and siores. thai is thev do no, f ■ , '"""is sinjle-cvck loads 

be kep, invisible """^ "'i* Pf^fnching. thev could 




perfonn a DCT „„ the ,oS lot colulri- 1, ""P'™"" operations is to 
perfo,» a second. ide„S'li^V!t7^S Ze";^,.^ *™ 

r; <;KTTrPT p • calculation IS performed entire v in registers u^ino 



Assume the matrix originally is in the order: 



0 12 34 

8 9 10 11 12 
16 17 18 19 20 
2-1 25 26 27 28 
32 33 3-1 35 36 
40 41 42 43 44 
48 49 50 51 52 
56 57 58 59 60 



7 

15 
23 



5 6 
13 14 
21 22 
29 30 31 
37 38 39 
45 46 47 
53 54 55 
61 62 63 



^Sione. Harold. -ParalJel Processine wth the Perlect Shuffle ' XV^r Tr 
Computers. Vol €-20. No. 2. Februa^- 1971. 15} Transactions on 
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Aiier one shuffling, the matrix is m che order: 



0 32 



36 
40 



37 
•41 
45 
49 
53 

24 56 25 27 
28 60 29 6i 



12 44 
16 4S 
20 52 



1 
5 
9 
13 
17 
21 



2 34 3 55 
6 38 7 39 
10 42 11 43 

14 46 15 47 

15 50 19 51 
22 54 23 55 
26 58 27 59 
50 62 31 63 



After a second shuffling, the matrix is in the order: 



•0 16 32 4S 
2 18 34 50 
4 20 36 52 
6 22 38 54 
8 24 40 56 



1 17 33 49 
5 19 35 51 
5 21 37 53 
7 23 39 55 
9 25 41 57 



10 26 42 53 11 27 43 59 

12 28 44 60 13 29 45 61 
1-1 30 46 62 15 31 47 63 



After a third shuffling, the matrix is in the order: 



0 
1 

2 
3 
4 
5 
6 



8 
9 



16 24 32 

17 25 33 



40 48 56 

41 49 57 

42 50 58 

43 51 59 

44 52 60 

45 53 61 

46 54 62 

47 55 63 



10 18 26 34 

11 19 27 35 

12 20 28 36 

13 21 29 37 

14 22 30 38 

15 23 31 39 

C code for procedure: 

void Matrix8By8Transpose(int16 'src. intl6 "dst) I 
int16 tm0[64]. ' ' 

intl6 tm1(64]; 
int i; 

for (i=0; i<32: i++) { tm0[2-ij = src(i): S.T»0[2-j^. 1 1 = srcri+3?1 i 
^ for (.=0: .<32: ( dst(2-ij = tm1[ij. ds;(2:Ui] I trmfi.^aj; | 

Assembler code for procedure: 

_Matrix8By8Transpose- 
L.128.1 
L.128.1 

G.SHUFFLE.8 
L.128.1 

G.SHUFFLE.8 
L.128.1 

G.SHUFFLE.8 
L.128.1 



r4.r2.0 
r12.r2.64 
r20.f4.r12 
r6.r2.16 
r22.r5.r13 
r14.r2.80 
r24.r6.r14 
r8.r2.32 



# 00 01 02 03 04 

# 32 33 34 35 36 

# 00 32 01 33 02 

# 08 09 10 11 12 

# 04 36 05 37 06 

# 40 41 42 43 44 
^ 08 40 09 41 10 

# 1.6 17 18 19 20 



05 06 07 

37 38 39 
34 03 35 
13 14 15 

38 07 39 
45 46 47 
42 11 43 
21 22 23 
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1-123.1 r~6'2 '^^ - ; -= i: '5 4? 

G "^HUFPLF a o'.'.'- "9 d1 52 53 54 55 

L123 1 * ^6 48 17 49 ,3 50 19 51 

L.128.1 r-='5-r5 I f i =^ 2^ 54 23 =0 

G SHUF=i F a IT- '.^ =3 =? 60 61 62 63 

GSHUFFLE.8 r4 r^C nn ik -jo ,o -^ , 

nCMiiccica ~ ; ~" -UU 16 32 43 01 17 33 49 

i lHUFFtil : 02 ^3 34 50 03 19 II 5? 

G.SHUFfCf 8 rr^U% Z °' |° 52 05 21 37 53 

G.SHUFFLES l^T^ i^ 06 22 33 54 07 23 39 55 

G.3HUFF[i:3 nl:|c :^3^ ! ?8 24 40 56 09 25 41 57 

G SHUFFLE a , - loc ':. ''0 26 42 38 11 27 43 59 

GiHUFFtl:! !S :p ;:;: ! ;2 23 44 eo 13 29 45 si 

r_..^,.r^„ "'4 30 4 0 62 15 31 47 63 

^s'S7''-' ;l^-:3H^2 ^ 00 08 16 24 32 40 48 56 

^'Sr^' ;|M;e^ ^^01 09 17 25 33 4 1 49 57 

^'S'''-' a;l32^ =^ 02 10 18 26 34 42 50 58 

W^-^ -3 1119 27 35 43 51 59 

^'JST'-' ' 2° 52 60. 

^'^2^.^'-" .loMy "-OS 13 21 29 3 7 45 53 61 

^'mr"" ' .11,^3^^^ * 06 14 22 30 38 46 54 62 

%'^2b!'' ' °^ '5 23 31 39 47 55 63 

B rO 
The resulting code transposes an S-by-S matrix using 25 cv'cles. 
1-Dimensionfs! ni.c:rr^t^ Cn<=ir^ Tr^r^^r^rr^ 

^include "jinclude.h' 

#define RIGHT_SHIFT(x.shft) ((x) » (sWI)) 

#define DCT.SCALE (ONE « LG2_DCT_SCALE) 

!C.',hr.Ki^"' '^^^ a couple more bits V 

#define LG2_0VERSCALE 2 

#define OVERSCALE (ONE « LG2 OVERSCALE) 

Wetine OVERSHIFT(x) ((x) «= LG2_OVERSCALE) 

''Copyrighi (C) 1991. Thomas G. Lane. 
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■ Seals a fractional ccnstant ov DCT_SCAL- •' 
^fSsfins FlX(x) ((INT32) (fx) - DCT.SCALE . 3.5i) 

/• Scale a !rac!!onal ccnstant dv OCT SCALE/OVERSCALE • ■ 

mL;it,c!>ed with an. cverscalea input V 
■ produce sometning thai s scaled Cy DCT SCALF •' 
:^c:.r:ne FIXOfx, ((INTsl) ((x, • DCT jcAli^-JE^slALE . 0.5)) 

• Descale and ccrrsctiy round a ya:ue xr\&' % s-»|ed <-m DCT «;rAi p v 
^cerine UNF.X(x) B.GHT_SH,R(,-x, . (0NE%VtG2:&S:kA'?E-V 

Same with an adciticnal divisrcn bv 2 -^rr^cMv r-^unriPd i jmpiy^./o^ - 
-"denne UNF,XH(x) RIGHT.SHIFTux)-, (CNE-^^I^^TctsCAS LG2 DCT SCALE-,) 

r Here are the constants we neea V 

' COS.u »s cosine of i-pi/j. scaled zy c5CT^3CALE 7 

?^define FIX| 0.707106731} 

?fd6fine COS_l_4 

^define SIN_1_8 FIX(0.382633432) 
#d6jine COS.1.8 FIX(0.9238795331 
#define SIN_3J COS i 8 
#define COS_3^8 SInICb 

#define SIN.1.16 nX(0. 195090322) 
#define C0S_1.16 RX(0.980785280) 
#de(ine SIN^7_16 COS^l 16 
Adeline C0S.7_16 SIN^I^ie 

?fdetine SIN_3^16 FIX(0.555570233^ 
.^define COS_3_16 FIX(0.831469612) 
#define SIN^^ie COS 3 16 
^fdefine COS.5_t6 SINI3I16 

C S^^/^o'-:* DCT.SCALBOVERSCALE 7 

/ OCOS_u IS cosine of rpi/j. scaled by DCT_SCALE/OVERSCALE 7 

^define 0SIN_1_4 F 1X0(0. 707 106781) 
#define 0C0S.1_4 0SIN^i^4 

^define 0SIN_1_8 FIXO(0.382683432) 
#define OCOS_1^8 FIXO(0.923879533) 
#define 0SIN^3 8 OCOS 1 8 
#define OCOS J_8 OSINlCa 

^define 0S!N,1.16 FIXO(0. 195090322) 
J^define OCOS^I^ie FIXO(0.980785280) 
^define OSIN_7^-|6 OCOS 1 16 
#deflne OCOS_7_16 OSINllje 

^define OSIN^3_16 FIXO(0.555570233) 
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#de?ine 0CCS_3_16 FIXOiC 331^656^2) 
;#d9nn5 OSlN.5^i6 OCOSl3_'f6 
^define 0005.5.16 OSIM 3 i6 



Psrfcrm a i -dimensional DOT. 
Nci5 rra; rnis zoae is specsaiizea 



:J-.e case DCTSiZE = 8. 



INLINE 
LOCAL vopd 
fasLdc:.3 (DCTELEM 



»n, int stride) 



/' many tmps have ncnoverlacping iji-r.- 
•^shculd be able to do this lot very v/sll 

INT16 .no. .n1. ,n2. in3. in4. in5. ino .n7 
NT 1c rmpO. tnnpl. rmp2. tmp3, rmD4 irr^z 
INTlo tmpio. tmpn. tmp12. Impi3- 
INTlc trr^pu. tmpiS. tmp16. tmol? 
INTlo tnr.p25. tmp26: 



inO = !n{ 0); 
in 1 = infsiride J; 
>n2 = in(stride-2 
in3 = in{stride'3 
in4 = tn[stride'4 
tnS = in(siride*5; 
in6 = rn[stride'6 
in? = in[stride*7 



fJashy register colourers 



-.PD. tmo7. 



tmpO = 
tmpi = 
tmp2 = 
trr.p3 = 
imp4 =r 
tmp5 = 
tmoD =r 



in7 + inO 
fn6 + fnl 
inS + In2 
tn4 + in3 
in3 - in4 
in2 - inS 
im - in6 



tmp7 = inO - in7 



tmpiO = imp3 + tmpO; 
tmpn = tmp2 impl: 
lmpi2 = tmpi -tmp2: 
tmpi 3 = tmpO • tmp3; 

in[ 0] = (DCTELEM) UNFIXH(amp10 + tmpi 1) - SIN i 4V 
.n[slr,de-4) = (DCTELEM) UNFIXH((tmpiO - tmpll) • C(5s:?i4): 

in[stride-21 = (DCTELEM) UNFIXH(tmp 13X05 1 8 + tmDi?-<;iw i a^ 
.n(stnde-6] = (DCTELEM) UNFIXH(tmp 1 S^S IN J^^^^ 

tmp16 = UNFIX0((tmp6 + tmp5) ' SIN i 4)- 
tmpi 5 = UNRX0{(tmp6 - tmpS) * COS.O): 

0VER5HIR(tmp4) 
QVERSHIFT(tmp7): 
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• ;mp4, tiT,c7. tmpiS. impie are -.erscsisa cy OVERSCALE 



i~c14 = imp4 + tmpi5. 
:rr.s2S = tmp4 - tmplS: 
:rr:c26 = tir.p? - tmp16; 
tr-.pl? = trnp7 + tmpl6: 




; Fsriorm the fopyvard OCT en one fcicck o; sarripiss. 

■ A 2-D DCT can be done by 1-D DCT on ==r-.T re-.; 
• ••::lc-.ved by t-0 DCT on each column 

GLOBAL void 

p.va.dct (DCTBLOCK data) 
tnt i; 

for (i = 0: i< DCTSI2E; 
fast_dcL8(data+i'DCTSlZE. l). 

for (i = 0: i < DCTSIZE; 
^ fasi_dct_8(data+i. DCTSIZE); 

The assembler code for the above procedure. caUed u-i:h siride=8. is as foUows- 
_fast_dct_8: 



L12B.I r4.r2.0 
•-■128.1 r6.r2.16 



inO = in[ 0}; 
# ml = in[stride 1; 

[.Mil '^.-nh^K * in2 = in(stride-2]: 

r10.r2.48 # in3 = in(stride-3]: 

L. 2B. r12.r2.64 # in4 = in(smde-4 : 

r14.r2.80 # inS = infstride'S]; 

L. 28. r16.r2.96 # in6 = in[strjde-6): 

n Anr> ,c M8.r2.112 U in7 = inlstrideV ; 

^ARn^f r20.r18.f4 # tmpO = in7 * inO: 

rSn r r22.n6 r6 # tmpi = in6 + inl: 

^aRo,! f24.rl4.r8 # tmp2 = inS + in2; 

rim ^R * tmp4 = in3-in4: 

rlnnlR * tmp5 » 102 - inS: 

?iul M r32.r6.rl6 # lmp6 = inl - in6: 

^ !RS ]! r34.r4.r18 # Imp7 = inO - in7: 

g JKr k '^n-1o '?2 « tmpn = tmp2 + tmpi: 

rina ! ^^0-f22.f24 # tmpi2 = tmpi - tmp2; 

r !nn - r42.f20.r26 # tmpi3 = tmpO - tmp3; 

r■MM?Anr^,. r48.r36.r38 # = tmpiO + tmpll 

G.MULADD.16 r44.r48.SSIN 1 4 S32768 
6.MULADD.16 r46.r49.SSIN"r4 $32768 
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f^iLn"^""' a Cerberus neruork may. for specificacion purposes be 

assigned a unique ne:work number, ir. the ranBe 0 ?16 i Tk«c» purposes dc 

never appear directly in Cerbern. c^^ZV P"^ net^-ork numbers 

>ar8« and mim.or nawork addrc 1 a tramacuons. so long as ,he 

net.-orlc add„ss=s mns, sa,4 Thc Sdons ' """^ """" 



targec[ :t iniriatorj 
targeuj ;^ targetj 
initiatori - initatorj, for all i 9t 



bandwidth of the L ewav neri" ''1"'' > ^""^ °^ Performance or 

or more transacrions ^vVe ted t^ f^"^^-*^' °- 

which they are initiated h^:'^^^^^^^^ ^"^""^ - 

focal Cibems tHrsfbte of 'r"'"^^^ ^ gateway on a 

is non-zero, the Seway whicrwe uJlS 5. 'P^f^^^'^A ""'"^er. If this byte 
this transaction' the gteJavl^^^^^ "'T^ ""^^ '^^^^ 

b., and Ji^i^^';^:^^- iriritrbt 

nlJr ' '''^"T re-transmitted on the target bu the ne iork 

i55'^°4°^FF~— ^^^^^ 

K„T f u**' °^ mformation through the gateway network Tctween 
Jinsm&Ll tt eS^rf^i^e^Ls ^^J^ /ateway^ mus/L dX 

that packet aborts on the ^^tlZs'lir.l:'^^^^^^^^^ 

Sn'tlfeT'' ""l '^'^ "'^r ' '^^"^ which rLpS ust baLly 

withm the time-out hmit on the target bus does not cause a ume-out on The 

Sn ITassuVd""^' ' ^""'^ °" unrthircond"it;on 

tr?ddrSSrSro:'*'" ? °" ^^'^ '"get bus (which may be from either 

tne addressed target or some time-out generator), the packet is carried in the 
reverse direction by the gateway network. This response ^and any further packets 
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arc earned una! the end of the transacnon. The contents of the response and 
further packets are not changed by the gateway network. -^"Ponsc ana 

I^Zrl ^"""r' 'f'" '^"'''"^ ' ^'''^'■'■^y' is carried bv 

ReoQBter 

in^r'!'"? ""'^ "'^"^^1 by inserting repeaters. A repeater eiectricallv 
separates two segments oi a Cerberus bus. but provides a transparent linkage 
between these two segments. Using a repeater is advantageous when he 

e.Tr' Y °' •^"'t '^r\ ^^^^""^ ^^^-'"^ °n a large bus wodd 

dZ« ' ^•J^ ensure diat device 

addresses remain unique across what is logically a single serial bus. 




Generally, a repeater will repeat each request packet seen on one side of the 
repeater on rhe other side, with a delay of at least one clock cvcle If two 
transactions appear nearly simultaneously on each side of the repeater the 
repeater must abort one of the transactions and permit the other to be repeated 
This arbitrauon must be performed fairly, such as by alternating which side of the 
repeater is preferred on consecutive collisions. 

A simple repeater continues untU the end of the transaction by repeating the 
response packets, which may appear on the same or opposite side as the original 
request packet of the transaction. 

If the topoloa^ of the Cerberus is construaed so that only target devices exist on 
one side of the repeater, the design may be simplified by the elimination of the 
arbitration tuncnon. In such a case, transactions may only originate from the side 
designated to contain initiator-capable devices. 

A more sophisticated repeater may "learn" which addresses arc on each side of 
the repeater, and only repeat transactions which need to cross the repeater to be 
completed. Alternatively, a repeater may be constructed with knowledge of the 
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litlZl \°-,8%^lf"'^ T '"'^ on one side and 

addresses 128..255 on the other, again permitting the selective repeating of 
packets across the repeater. ° 

Synchronous ReoRf^tRr 

A very simple form of repeater may be employed to divide up the capacitive and 
leakage load on the SD signal ot a Cerberus bus into two or more segments, when 
a common 5>C clock reterence is used. 





Synchronous 
Repeater 




SDg-* ► 


SD,-. - 

SC . 




ss 

Synchronous reDeater 





360 



Case 2:05-cv-O05O5-TJW Document 1 49 Filed 10/1 5/2007 Page 4 of 40 

wo 97/07450 ^ 

PCTAJS96/13047 




bynchrono us repeater implementation" 



SlilT network, this repeater improves performance by dividing up the RC 
delay by a factor of „. though two bus settling periods now occur on each^SC clock 



period, so the speedup is approximately j. 
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This 
puU- 



circuit can be economicaUy implemented usina a sinzle TTL '621 part and a 
up resistor: - - k u «■ 



SC. 



SD. 
SO, 

sd! 

sd; 

SDj 
SD- 



8 



'621 



GBA 
GAB 




A 


1 


B 


1 


A 


2 


B 


2 


A 


3 


B 


3 


A 


4 


B 


4 


A 


5 


B 


5 


A 


6 


B 


6 


A 




B 






7 


7 


A 


8 


B 






8 



500 Q 



Eight-port synchronous repeater 
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Icarus IntGrnrnnnf^sor PrntnnnI 

MicroUni:y's Icarus interprocessor protocol uses Hermes high-bandwidth 
channels to connect Terpsicnore processors together, either directly or through 
external switching components.- permitting the consir uction of shared-memorv 
coherently- or incoherently- cached multiprocessors. Icarus uses Hermes in the 
Dual-Master Pair contiguration. and can be extended for use in "Multiple- 
Master Ring contigurations. ^ 

Internal daemons within Terpsichore perform and respond to Hermes write 
""^TaTa '^'^^ ^"'V' interprocessor communication protocol is 

embedded. These daemons provide tor the generation of memorv references to 
remote processors for access to Terpsichore's local phvsical memorv soace. and 
tor the transport of remote references to other remote processors. ' ' 

interDrocep.F;nr TooologiRc; 

The sirnplest multiprocessor configuration that can be built with the Icarus 
protocols is a dual-processor: 




The diagram below represents the same dual-processor system, in a simpler 



notation: 



Dual-processor Terpsichore with Icarus interprocessor link 



In the configuration above, a pair of Hermes channels are connected together to 
form an Icarus Interprocessor link in the Dual-Master Pair configuration. A 
Cerberus bus connects all the system components together to facilitate system 
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cjDnfigurarion. The Terpsichore processors all run off ot a common frequence- 
clock, as required by the Hermes channels that connect bciueen processors. 

Dual Terpsichore processors with dual Icarus Unks mav use both links to enhance 
svstem bandwidth: 



Dual-processor Terpsichore with Icarus interprocessof 



link 



A Terpsichore processor's dual Icarus links, each in the Dual-Master Pair 
configuration may connect to tvro different processors. Using the Icarus 
Iranspondcr daemons in each processor, several processors mav be 
interconnected into a linear network of arbitrary size: 




The Icarus links may also join at the ends of the linear net^vork. forming a ring or 
arbitrary size. 




In the configuration above, two Icarus links are connected to each Terpsichore 
processor, forming a single ring. 
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By connecting Icarus links into 4-mascer rings, providing Hermes nascer 
toru-arding for responses using the Icarus J-ransponde daemons in each 
Ery'stz -"connected into a c^-o-dimcnstonal netlJork of 




Sixteen-processor Terpsichore with Icarus internrnrp^<.nr links 



?k!c.rk "5""^"^}°"^ '°P°'°eies can be constructed by using multimaster rings 
?^,n.f 1 U blocks An n-master ring („<4) of Terpsichore processors has n 

rlVfT. , f'^^'V '""'^''^l^ for connection into dual-master or multi-master 
configurations. For example, with a 4-master ring: 
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These building blocks can then be assembled into radix-n switching net^vo^ks: 

By connecting Icarus links to external switching devirpc n,„lr:,, • . 

We ..ber OF processors can be co.....Jl^-,^;:^^:^^ 




&"9ht-processor Terpsichore with I carus links to switrhinr^ f.h..> 

Link-levRl and rmn<:i^ntion-lPiK/f=,i Prr^tr^^^i 

Icarus uses the Hermes protocol at the link-level anA U- 

embed a transaction-level protocol. Hermes operations to 

Two-packet link-ftn tjon nnmt^nHsh 
I Request ' 
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These actions and their relation to the data flow is shown below: 



Requester 



'Quest 



Response^ 



Link-action actions and data flow 



ResDonder 



Four-np.rkfit Tmn-si ^^cticn nnmenohtt iri^ 
We designature the term "transaction" to describe the upper-level packet 

Transactions are used when the latency of a transaction may require that more 
than eight actions are outstanding at a point in time, in order to maintain the 
desired throughput of the protocol. Embedding the transaction protocol above the 
implemen?ed'''°'°" ''"^""^ ''^ link-level state which must be 

Cenain of the packets that make up a transaction contain an eight-bit transaction 
Identifier, or tid. which permit up to 256 outstanding transactions to be in 
progress at any point in time These packets also contain link-action identifiers. 
Uds which connect these packets with others which are part of the transaction 
but do not contain a tid. 

Transactions consist of four actions. Each action results in one or more link-level 
Hermes packets transmitted on the channels: 



Request 


1 the action taken bv a requester to start the transaction 


Indication 


! the reception of a request bv a responder 


Response 


= the action taken by the responder to finish the transaction 


Confirmation 


the reception of the response by the requester 
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These actions and their relation to the data flow is shown below: 



Requester 



-SyQuest 
U^dtcaiion, 

Response, 
Transaction aciions and data flow 



Responder 



SSacrf«nf !t relataonship between iransaction-level actions and 

aciions, snowing typical transaction messages and link -action cnmm^nH';- 



Transaction- 
level action 


Typical transaction 
messaqe 


Link-level 
actions 


Link-action command 


Request 
Indication 


read/write-sizelet- 
request 

Remote-Indication 


Request i write-octlet 


Response 
Confirmation 


read/write-sizelet- 
response 

RpmntA-rnnf irm o + 1 


Request 


write-response 
write-octlet 




ransaction protocol for Icarus Requester Daemon 
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Icarus Action Format 

Request and Rf^soonse acnons 

LlfZ J'""^-]^^^ octlet operations comprise an Icarus request or 

response acuon. The address ot the vvrite operation contains target rouiina 
transacuon-id. commands and sequence information in the foUowing format: 

A remote request is a write octlct to an address of the form: 

3j 16 IS 8 7 n 

I node I tid t com ( 



16 



with data of the form: 
63 



octlet 



64 



] 



Th= tid field coniains an 8-bit mnsaciion id code which must be tetutned aloii= 
with the remote response- The tid field value must be unique among aU 

'iZ^^^.'^Zyr^:^'- °' '"^"^^ 

The com field contains a 6-bit command code which, in the first octlet. designates 
the operauon to be performed in a request action or the result returned in a 
response action. If the command code is in the range 0. 31. in successive ocdets. 
the value ot the com field mdicates whether the number of ocdets to follow (0 9) 
such that the last ocdet of a message contains a com field uith a 0 value. 

The node field contains a 16-bit node address which is the target of the action. 
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When embedded inio a link-levei urite ocdet oocration, the Terpsichore 
requester daemon request appears on the Hermes m the form: ^"P^i<^*»ore 



ma| 2 I iTd 



com 



tid 



"Ode7..o 



nod6is..8 



octletea.sft 



octletss a« 



OCtlet47.,an 



OCtiet39..:»? 



octietai 9^ 



OCtiet23..ifi 



octieti5.,> 



octlet7..o 



check 



A transaction which has a payload of one octlet must use a Hnk-level u-rite octlet 
successively use hnk-level wmc octlet operations to transmit the payload. 
Indication and Onnfir matinn !=rfir^n<: 

Indication and Confirmation actions consist of a series of link-level write octlet 
response packets, one for each octlet of the Request and Response action " 

Icarus Rfi auestRr n^(=^mrnn 

""""P" ' '^^l °' * physical address in which the 

Reau^ter Da.^^^^^^ °^ ' remote Terps.chore processor. The Icarus 

Kequester Daemon is an autonomous unit which attempts to satisfy such remote 
memory references by communicating with an external device dther an oX 

Tl?.'£tlt °' ' '^''^'"5 ^^»^h ^^"^"^Jy «^ches ano he 

icrpsicnore processor. 

LH'rL'n7°'l- k^"''°l" characterized by an eight-byte physical bvie 

processor nod specitying a local physical address on that 
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l.Ltr ^^'''"'f"",.^^ '"T t"''^''^" "^^ ^^'""'^ "^^"'o^ ^efe^ence uith a 
n h/nn^" ot.^ght b.cs. perrnuting up to 256 such remote references 

to be outstanding at any time: however, implementation Umits within Terosichore 

may set a smaller bound. ' ^"^^it 

Jn^J^.';n"!t^^'''""'"u^''r" '?''"r^\' °f Transaction Requester, and 
an external device takes the role of the Transaction Responder. The daemon 
generates writes to a specified byte-channel and module address, which causeTan 
external device to read or write remote octlets or cache imes in a remote memorv". 

dS:&XTir^^^^^^^^ - — - - 

Icarus RR SDOndRr D^f^mr^n 

module address, which enable an external device to generate tran action requests 
to read or write octlets or cache lines in the Terpsichore's local memory or to 
IZeTr ^VA^'^'" '^^ link-level uSteT'to tl^e 

Sgrt^Tvt^Thrny^x^ -^^^^ - — - - 

l"<:nnn5rl t^'^u '"i' ''^^ Transaction Requester, and the Icarus 

Responder takes the role of the Transaction Responder. 

Icarus Tra nsDondfir DaRmnn 

The Icarus Transponder Daemon accepts writes from a specified Hermes 
channel and module address, which enable an external device to cause an Icarus 
add^esr*" ^° B^eraie a request on another Hermes channel and module 

Terpsichore contains two such transponder daemons which act concurrently 
(back-to-back) between two different byte-channel and/or module addresses. 

58The itm "sequence number" is avoided here, because the transaction-tags are not necessarily 
sequential in nature. 

''The number of link-level requests to be outstanding is still under study. 
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Icarus Rrhi jn.^t 



The rolloymg table summarizes rh.e commands- used for Icarus requests and 
responses fresponse command showr. in bold): 



cocs 




payloao 
forflpTQ^ 


J 


las; ccriet o? rr,ult!-cc::5- ccmmand 




1..9 


coniinuation octlet =r .-nuiti-octier rrmm^nrl 




10. 19 

20 


Reserved 




21 


read/add/swap octlet response 


1 

1 


22 


read incoherent weaK cacne-line 


i 

1 


23 


write response 


4 
■ 


2d 


read allocare strona ccti^t 


1 


25 
26 


reaa noallocate strcnc rru&r 

read allocate weak cc::e- " 


1 


27 


reaa noallocate weaK cctiet 


1 

1 


23 


read allocate strona r.-yipr ~ 


] 


29 


read noallocate strcnn n^yiPt i 


1 


30 


read allocate weak h-x»^^ 


1 


31 


read noallocate weak n^avipt ' ' 


1 


32 


read hexlet response 


2 


33 


read incoherent cach^-linA rAerk«%M«*A 


8 


34 


read coherent cache^linA McnnncA ' — 


9 


35..36 


Reserved 




37 


read coherent strona cache-line 


2 


38 


Reserved " " ■ 




39 


read coherent weak cache-lrne 




40.. 51 


Reserved 




52 


write coherent strona cache-line 


1 u 


53 


write incoherent strona cache-line 


Q 


54 


write coherent weak cache-line 


1 u 


55 


write incoherent weak cache-line 


Q 


56 


write allocate strong octlet 


2 


57 


write noallocate strong octlet 


2 


58 


write allocate weak octlet 


2 


59 


write noallocate weak octlet 


2 


60 


write allocare strona nexlet 


3 


61 


write noallocate strcno hexlet 


3 


62 


write allocate weak hexlet 


3 


63 


write noallocate weak hexlet 


3 


64 


add-and-swap allocate strong octlet little-endian " 


2 


65 


ado-and-swao noallocats strong octlet little-endian 


2 


66 


add-and-swao allocate weak octlet little-endian 


2 


67 


aca-ana-swap noaiioca-e v^eak octlet litile-enoian 


2 
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DO.. / 3 


Reserved 




ou 


aod-and-swao ai!cca:= r/zr.o ociie: oia-enaian 


2 


Q 1 


add-and-swao ncailcca-e strona cct;et 'biq-^»ndian 


2 


^9 
D<l 


add-and-swap 3:lcca:e v.eaK cc:i6t c:o-enaian 


2 


oo 


add-and-swao ncaiiccaie /.eak ociiet oig-endian 


2 


Oh 


comoare-and-swac a:;cc=-e strcna octlet 


3 


OD 


comDare-and-5-.vaj rosiiccaie strona ocilet 


3 


OO 


compar9-and-sw£D ai:cc2te weak oc'i-^t 


3 


O 


comoare-and-swao reallocate weak ocilet 


3 


DO 


muitiDlex-and-swap asiccate strona octl=t 


3 


oy 


multiplex-and-swao r.c=i;oc=te strona octlet 


3 


90 


muitiplex-anO-swan p inrars v/kcol- /-^^tl,^- '"■ 


3 


91 


multiplex-and-swao noailocate weaK oct'et 


3 


92 


nnuitiDlex allocate strcr.o cctiet 


3 


93 


multiplex noalloca-e 5;rcna octlet 


3 


94 


multiplex allocate w£=k octlet 


3 


95 


multiplex noailocate ■.ve=K octlet ' 


3 


96-255 


reserved 





Icarus Reaussi commanas 



remoie (add.svvap.or.and} octle: request is data of the form: 



address | 
bytes 0..7 [ 

64 ■ 

A remote read incoherent (strong.weakl cache-line request is data of the form: 



I address 

64 

remote read coherent {sirong,\veakl cache-line request is data of the form: 



coherence tag 

64 
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te write incoherent cjche-line re^rje-: is data ot the to 



address 

bytes 0..7 

bytes 8..15 

bytes 16..23 — 

bytes 24..31 

bytes 32..39 

bytes 40..47 ~ 

bytes 48..S5 

bytes 56.. 63 

remote u-rite coherent cache-line recues: is data of the form: 

65 

' . 0 

address 

coherence tag 

bytes 0„7 

bytes B.,15 " — 
bytes 16,.23 

bytes 24,,31 — 

bytes 32,.39 
bytes 40.,47 

bytes 48..55 

bytes 56„63 " ' 



Lm"''''^ '^^"^ (allocatcnoallocaie} {sirong.xveakl octlet request is data ot the 



63 

i address 



A^mote write laIlocate,noaIlocatc} (scrong.weak) octlet request is data of the 



63 

I address 
bytes 0..7 
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A^r^mote re.d (allocace.noallocarei (sirongAveaki hexl 



ei request is data of the 



t3 



address 



A remote write (ailocate.noallocate} {<:rorc> wenH K^vl.r j 

torm: -^rongAveakl hcxiet request is data ot the 



address 
bytes 0,,7 
bytes 8.,15 



6*: 



IcartJS Inriinatir^n 

coniains the lid valut If ,1,, I L I °f ''"k-'^'l »Tite rcsponse packet 

level tequests andr,rann,;i '"P""'' 'l" ^W"y to receive .ddiiional link- 
»bU.tv ,1 teTetJe' dS;^i=^jL;tX",„°e.^'°^' °' "^^ 

cp.n,„d, „d tid ,„fot„»„„, ^-.d IVdtTrio'nt^e^dTr™";^^^^^ 



0 



1..9 



ia.22 



23 



24..31 



command 



termination 



continuation 



Reserved 



wnte response 



Reserved 



read/add/swap octlet response 



payload 
{octlet s) 



34 
35 



read hextet respons"e 



read incoherent cache-Une responle " 



7-255 



read coherent cache-line response 
reserved 



3_ 

9 

10 



Icarus Response codes 
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The com field contains an S-bit rre«:.-e comn, ' • . 

previously. ■-— -e command, as given m the table 

The lid field contains the S-bit transjcrior d code u^eH in ,1,. 

.^1. .u v.oae used m the request message. 



The node Held contains the I6-bii processor number used in the 



A remote [read.add^swapi octlct response is data of the ton 
63 



request messaoe. 



m: 




A remote read hexlet response is data or the torm- 
63 

bytes 




A remote read incoherent cache-hne resDonsi 
63 

. bytes 0..7 



onse is data of the form: 



bytes 8..15 



bytes 16,.23 
aytes 24.,31 
bytes 32.139" 



bytes 40..47 



bytes 48..55 



bytes 56,,63 



64 



A remote write response is data of the form: 

63 

I 5- 



64 
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A remoce read coherent cache line response is data of che form: 



coherence tag 



bytes 0,.7 



bytes 8.. 15 



bytes 16..23 



bytes 24..31 



bytes 32..39 



bytes 40..47 



bytes 48..55 



bytes 56..63 



64 

A remote wmt coherent cache-line response is data of the form: 



CJ 



coherence tag 



Icarus Confimn^tinn 

An Icarus Confirmation consists of a Imk-levei u-ritc-response packet for each 
Imk^evel write issued as an Icarus Response. Each link-level xvrite-response 
packet contains the l.d value of the link-level write-request packet Th s series 
re°en.fatd.^nd ^.^"1°' ' and'.nd^ca'tmg the abmtT to 

lece oi of the^in ''i ^f''"""-'"'^ " transaction-level confirmation of 

^equfsts ^ '° additional transaction-level 

Deadlock 

The Icarus Requester. Responder. and Transponder daemons must act 
cooperatively to avoid deadlock that may arise due to an imbalance of requests in 
the system which prevent responses from being routed to their destination 

The requiremenis vary depending upon the characteristics of the svstem 
configuration, and the mechanisms for deadlock avoidance are still under study. 

Principal mechanisms to employ are cyde-free-routing of requests, and the means 
to prioritize responses above requests in for^i-arding priority. 

Error handling 

The Imk-level packets contain a check byte which is designed to detect sinelc-bit 
transmisstion errors in the Hermes channel. 



377 



Case 2:05-cv-005b5-TJW Document 149 Filed 1 0/1 5/2007 Page 21 of 40 

wo 97/»74S0 PCT/US96/13047 



When either party in an Icarus transaction receives a packet ^vkh a check error, it 
.mmediatelv shuts down input processing to avoid encountering further errors as 
may arise trom errors which disrupt the parsing of packets. It also generates an 
error packet, which ensures that the other party is notified of the error. 

The Target of an Icarus transaction must maintain a copv of the link-level address 
ot the most recent correctly received hnk-level write operation in a Cerberus 
register. Terpsichore then will dear the error using the Cerberus channel 
resetting the Hermes input processing. Each party then re-issues anv outstanding 
link-level transactions. • "^""""""'o 

'^.^'rZTi "^t'"' ^''ii^ ^'"^-'^^-^^ P^^'^'^^l ^'^^ ensure chat 

the error handling mechanism does not result in missing or repeated operations. 

protocol contains non-idempotent operations. 
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Fixed-ooint Aooiin^finn^ 
P-ng Mcst-!=::r^nif!ca^i: Or^ 

The tollou.ng example performs a 1^6 most-siBniticam one" ope-ation on 
general register ra. placing rhe result in general reaisier rb. operation on 

E.ALMS rb.ra 

J^o'llf"'^''! '■''""'l'^ ' ■■^'•'^'^ leasr-signa-icanc one" operation on cencral 

register ra. placing the result in general register rb. pc""on on ^encral 

E.ADOl rb.ra -1 

E.ANDN rb.rb..'a 
E.ALMS rb.rb 

Floatino- ooint Annlin^-'^nc, 

The following e.xample demonstrates the inner loop of the Linpack benchmark. 
This section is under construction. 

Digital Signal Prnn^.c,c,ino Annlin^tir^nc: 

This section is under construction. 

Image Processino ApoHcatinn.cj 

Sm 'i^kh '''T'" demonstrate several applications, listed below in summarv 
torm uuh the pertormance estimated. The estimates assume singlc-cvcle loads 
^he^Z'L I ^° "-he mifses. Hovvever 

b^ kTpr Sble.'""" '"^ '""^ prefetching, they could 
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-i;^ ■ r •Kf\^> if\ jit 

o-coint nonzont ai Uec.mat.on or Mnnnrnm m. 
D-po mt vemcai uecimaiion of Mr.nr.rnrnm o 
3x3 Decimation ot Monochrome I maoe — 
5-pGint Horizontal Decimation 



1.0 



0.9 



1.3 



5-coint Vertical. Decimation ot Co 



r Co or Imaoe 



3x3 Decimation of Color Imaoe 



!ICf 



Image 



0.5 



0.2 



0.3 



i-iiterrn of Mnnnrhromf^ r-s-^a 

kOO kOI k02 
klO kll kl2 
.k20 k21 k22 

votd MonochromeFiltef(int8 -src. intS -gss. int row int ocount 
intB kOO. ini8 kOl. intS k02 ^ ' 

intS klO. intS kll. intS k12 
intS k20, intS k21. intS k22) ! 
fcr (i=0: i!=pcoun;; i++) ( 

astfi) = (src(i-row-l]-kOO ^ srcf.-rcwrkOl * srcr,-row-irk02 * 
src .-iJ-klO + sfc[i]-k-,l * src[i*lj-k12 * 
^ src(.+row-1]-k20 * src[i*row]-k2l + srcii+row+lJ-k22)»8. 

I 

^X'e now examine the assembler coding of the inner loop. Because there are ei.ht 
pixels in an octlet. the input size of the multiplier, this loop filte^ eieht pixds a 
once. The coeffiaents are placed m 9 registers svmboUcallv namerk(W k22 a 
copy o coefficient m each byte of the^egister. We assume That the' coeffi^ em 
are scaled so that sum of products do not overflow a 16-bit inrege accum^^^^^^^^ 
peaficaUy. the sum of the absolute values of the coefficients <= The S 
po nters " ^^"'^^^'"^'"^ "^'"^^ ^^8^""^ 8.9.and 10 conrain ar av 



A. SUB 
L.64 

G.MUL.8 
L.64 

G.MULADD.8 
L.64 

G-MULADD.e 



r2.r8.rovy 
f3.-l(f2) 
r4.r3,k00 
r3.0{r2) 
f4.r3.k0l. 
f3.t(r2) 
r4.r3.k02.r4 



f4 
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L.c4 f3..:(^3) 
G.MULADD.5 r4.f3.k:0.r4 

G MULADD.6 r^.rj.ki ;.r4 

G .MULADD.3 r4.r3.k:2.r4 

-^•ADD f2.r8.r3v/ 

L.c^ r3.-l(r2> 

GMULADD S ra.f3.k20.r4 
r3.0(r2) 

G.MULADD.3 r4.f3.k2l.f4 

1-64 r3.1(r2) 

G.MULADD.3 r4.f3 k22 :4 
G.COMPRESS. 16 r4.f4.8 

S64 r4.0(r9} 

AADD r8.3 

A. ADD r9.8 

B. NE rS.nO.lb 

\X'ich some obvious reordering oi the jddress computation instructions, this can 
run m 10 cycles, assuming single-cycle latency for G.MULADD. Loop unroliine 
Tn ^Zli''' r f l^^ency The inner loop is 10 cycles per dght pixels", 

or O.S pixels, cycle. Counting each multiply as S operations and each multiplv and 
add as 16 operations, we are running at 8+8^6=136 operations/loop '/ 10 
cycles/loop = 13.6 operations/cycle. 

Note that our design actually loads each pixel nine times, xrhich is making good 
use ot excess load bandwidth and data caching. 

Filterino of Colnr Im^n^ 

For a color image, we assume chat the image is made up of pixels each 3^ bits in 
size. 8 bus tor each of red. green, blue, and alpha. We treat each component 
identicallv so the same algorithm is used, but the offsets change slightiv A C 
version ot the code is: , - 

void ColcrFilier(tnt8 •src. int8 •dst. ini row. int pcouni * 
ini8 kOO. ini8 kOl. intS k02. 
int8 klO. inl8 kll. int8 k12. 
ini8 k20. intS k2l. int8 k22) i 
lor (i=0; i!=4*pcount; | 

dSl(i] = (SfC(i-row-4)'k00 src[i-rov/l-kOl + src(i-row^-4]-k02 + 
srcM)-klO + srcnj-ktl + src[i^4l-k12 + • 
^ src{i+row-4]-k20 + src[i+fow]-k21 + src[i+row+4]-k22)»8: 

) 

The assembler coding of the inner loop is: 

1: A.SUB r2.r8.row 

L64 r3.-4(r2) 

G.MUL.8 r4.r3.k00 

L.64 r3.0(r2) 

G MULADD.8 r4.r3.k0l.r4 

L.64 r3.4(r2) 

G. MULADD.8 r4.r3,k02.r4 

L.64 r3.-4(r8) 
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G X1ULADD.3 ri.rS klC'i 

G.MULADD.3 .'J.rSk^t r4 

G.MULADD.8 r4.r3..<-2.ra 
^•'^■^0 rZ.rB rc.v 

G uIULADD.3 r4 r3 ^2^ m 
r3:Cir2)^ ' 

G tVlULADD.8 rj.r3 k2l -4 
r3.4(r2) 

G MULADD.8 r4 r3 k22 r4 
G. COMPRESS. 123 r4.'f4;3 
S.64 

A.ADD r3 B " 

A.ADD r9.8. 
^■^^ rs.no.ib 



Ccnyor.'^i on of Mnnpchrnrr,^ 

vord MonochromeToColor(int8 'src. ini8 ,nt pcount) | 

for (i=0: i!=pcouni: i++) j 
Ctst[i] = srcfi]; 
dst[4-j+ij = srcni; 
dstl4-»+2I= src[il: 
dst(4-i+3] = 255; 



1: L.64.B r4.0(r8) 

G.SHUFFLE.16 r2 r4 r4 

G.SHUFFLE.16 r8 r4'f'; ' « c 

G.SHUFFLE.8 [t ill ^ 

G.SHUFFLE.8 r8'r3r9 

S.128.B r6.0(r9) 

S-^28.B r8.16(r9) 

A.ADD rS.S 

A.ADD f932 

BNE rsino.lb 
The above sequence is 4 cycles per 8 pixels, or 2.0 pixeJs/cvcle 
vo.d MonochromeWi,hAlphaToColor(in,8 'src. ,n,8 -alpha. .n,8 'dst. m, pcount, | 
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'zt (!=:-. i! = pccunt: r + -) | 
cs:[i) = sfcfij. 
astM";-tJ = src[ij: 
C5r[^-:^2]= 3fC[.]: 
C5:{^'(-3i =: alpnafil: 



Which results in ihe foUou'ing inner Ioqd: 



L.64.B 
L.64.B 

G. SHUFFLE. 16 
G.SHUFFLE.16 
G SHUFFLE-S 
G-SHUFFLE.a 
3.123.B 

5 123. B 
A ADD^ 
AADD 
A.ADD 

6 NE 



r4.0(rS) 

r5.0(f9) 

r2.r4.f4 

r3.r4.r5 

r6.r2-f8 

rB.r3.r9 

r6.0(f 10) 

r8.16(nOi 

rs.a 

r9.3 
rl0.32 
r3.f1 1.1b 



The above sequence is 4 cycles per 8 pixels, or 2.0 pixels/cycle. 
Ccnver^io n ofCnInr to Monrr^mnn^ 

iLT'nAU^ ' monochrome image, a weighted sum of the red. 

green and blue components is generated. These weights. kO. kl. and k2. are 
selected so that k0^kUk2 = 256. so overflow does^not occur. The resultm' 
ovefflJ^^^^ truncated, rather than rounded, again, to avoid the possibility ot 

void ColorToMonochrome(int8 'src. inlS 'cs;. ,nt pcount. int8 kO. intS ki. intS k2) | 

for (irO. i»=pcouni: ( , 

dst[i] = (sfc(4N]-kO + src(4-i*i]-ki ^ src(4-i+2rk2)»8: 



^Tiich results in the following inner loop: 



L.128.B 

G.DEAL16 

L.128.B 

G.DEAL.16 

G.DEAL.8 

G.DEAL.8 

G.MUL.8 

G.MULADD.8 

G.MULADD.8 

G.C0MPRESS.16 

S.64 

A.ADD 

A. ADD 

B. NE 



r2.0(r8) 
r2.r2.r3 
r4.16(f8) 

r2.r2.r6 

r4,r3,r7 

r6.r2.kO 

r6.r3.k1.f6 

r6.r4.k2,r6 

r6.r6.8 

r6.0(r9) 

r8.32 

r9.8 

r8.no.ib 



#k0k1...k0k1k200..-k200 

#k0k1...k0k1k200...k200 
#k0k0...kGk0klk1...klk1 
#k2k2.. . k2k20000. . .0000 



#ioss away low precision 



The above sequence is 8 c>xles and writes 8 pixels, or 1.0 pixels/cycle. 
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int i: 



for (i=0. i!=pcount: { 

dstfi] = (srct4'i)*k0 ^ srcf4N*il-ki ^ 
^ atphafi) = src[4-k3): 



Which results in the following inner loop: 



src{4-i*2rk2)»8: 



L12B 

G.DEAL16 

L128 

G.DEAL16 

G.DEAL8 

Q.DEALB 

S.64 

G.MUL.8 

G.MULADD.8 

G.MULADD.8 

G.C0MPRESS.16 

S.64 

A.ADD 

A. ADO 
AADD 

B. NE 



r2.0(r8) 

f2.r2,r3 

r4.l6(r8) 

r6.r4.r5 

r2.r2.r6 

r4.r3.r7 

r5.0(rlO) 

r6.r2.k0 

r6j3.kl.r6 

r6.r4.k2.r6 

r6.f6,8 

r6.0(f9) 

ra.32 

r9.8 

no.e 

r8.r10.ib 



*k0kl...kOklk200...k2O0 

#kOk1...kOk1k200...k200 

#kOk0...kOkOklki...kiki 

«»k2k2...k2k20000...0000 



tftoss away low precision 



The above sequence is 8 cycles and .rrites 8 pixels, or LO pixels/cycle. 
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sphere, or drawn on'a^urface tha" s nlred ;, h r'spe"^^ ^^"""'^ ^ 

principal data structure used to generate lllu ^ ^ 
copies ot the imaee as shoun in rhT. u . " ^^J^" ^ °^ decimated 

because tnterpolSon of ^he cle4nt?7ther "^^"^ ?' ^^"'^"^^^ 
antiahased spatialJv-warped image Ce tL rK. r T ' P^^P"')' 

always exactly four^irncs larger L "^.e nrLin.r l"!' °i « 
the image decimated in ei her the x or v H ! ? ""'^u ^ "PX of 

and smaller gomg ngh^ and down m rl! The images get smaUer 

The originallmagVneed not t s" '"re L """^ ^ ^ ''"^'^ dot. 

structure. ^^'^ °' ^'^''^ ""^ t^'^t PO«^ers of two tor this 




In the sections below, we explore two pans of the problem, the creation of the 

table Th""""' '^'"'"'/^ ouZs I he 

table. These are the pans of the process which r:ust be performed in real tiiie fo^ 
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reaJ-time application of this process the 

be precomputed. and are a tunction of the renT;."^ '' "^'"1 ""'P' o"^" 

iiic renaering system used. 

J^rts^TJ^-:^!^^^^^^^^ ^bove C3n be dtvtded Into nvo 

direction only. The former venerates ^1^^. hi " u'"'^ decimating m the vertical 
nnage and the latter generafes :hTr\fJn. biocfe ? u^^'.""' t^' ^^^mal 
This divides the problem into two pTrt each u 5°'" '°P 

'.vhich ,s a great advantage because the amounrl^f- . ^"^ one-dimensional filtering. 

selected so that Ic0+kl+k2+k3+k4 = 25?"!^ J'^", ^«e weights. k0..k4. are 
weighted sum is truncated, rather than rounH?J ' ^'^^ '""'""^ 

overtlovv. ^"^'^ rounded, again, to avoid the possibJity ot 

vca Hor:2onta[Oec.mat.onMonochrcfr^(jn:s -3.- g -rt,. 

,n, I L"'^ '"'S "^l- 'ntS k2. int8 k3. ,n:S k4) ( ' '"^ P"""' 



ini i.j.k; 

for {k=0.i=0: k!=pcount; ) | 
'or 0=0: j!=dfow: | 

^^^^ src[.+ i]-k3 ^ src{.*2rk4 )>>8:' ^ 

1 

i+=srow-drow-clrow. 

I 

VChich results in the following inner loop: 

^ L.128 f6 -2(r8) 

G.DEAL.8 re.rg 7 

G.MUL.8 r4.r6 kO 

G.MULADD.8 r4.r7:k1.r4 

i;'28 r6.0(r3) 

G.DEAL.S. ,6.f6.f7 

G.MULADD 8 r4 r6 k2 r4 

G.MULADD.8 r4.r7.k3:r4 

X r6.2(r8) 
r6.r6.r7 

G.MULADD.8 r4.r6 k4 r4 

G.C0MPRESS.16 r4.r4".8 ' 

r4.0(r9) 

A.ADD ,8.16 

A.ADD fg 3 

rS.'nO.lb 

pixels/cycle.) ^ « 6 cycles per 8 pixels, or 1.3 
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VC-hen decimating .n the vertical direction, the rate is even higher stUl: 
vcid VerticalDgcimaiicnMcnocfiron^eiinta -sr- ^.-.i 9 --v -mem.., - 
!"« i.j.k: ' ' 

fcr (k=0..=0. k:=pcoum: ) j 
fcr (;=C} )' = 3ro-.v; + ) { 

d5t(k,-! = (srci.-2-s."w]-kC . srcfi-srowl'k; + srcri]'k2 + 
3r=li+src-,v:-K3 srcli-Z'srcwj-kJ )>>8 



' = S r 0 w + s ro w- d r 0 w : 



Which resuhs m the following inner loop: 

1 A.SUB r2.r8.fC".vt2 

^64 r3.C(:2) 

G-MUL.8 r4r3k0 

ASUB r2.r8.row 

L-D"! r3 r.('2) 

G.MULADD.3 r4T3.k1 r4 

L-S-* r3.0(r8) 

G.MULADD.8 r4.r3 k2 r<t 

AADD r2.r8.row 

r3.0(r2) • " 

G.MULADD.8 r4.r3 k3 r4 

AADD f2.r8.rowt2 

•-•64 r3.0(f2) 

G.MULADD.S r4.r3 k4 r4 

G.C0MPRESS.16 r4.r4.8 

r4.0-8(r9) 

A.ADD rS.S 

A.ADO r9.8 

BNE r8.r10.1b 



This runs in 6 cycles per 8 pixels, or 1.3 pixels/cycle. (For 3 pixek u-ide the rate is 
4 cycles per 8 pixels, or 2 pLxels/cycic.) 

To generate the decimated array shou-n above, for a n2 image. n2 pixels are 
generated m the horizontal direction, and 2n2 pixels are generated in the vertical 
direction. Using 5 pixel filter functions, this takes: n2/0 9 + 2n2/l 3 = 
llc^cle^"^^^-^^ ^ ^-^^'"^ ' '■"'^8'= be decimated in 2.8 

ilitr?°m,f''^t'° 5''""^^^«°"sly decimate in the vertical and horizontal 
direction. While this may be more expensive that separately decimating in each 
direcuon. it permits the use of filter functions which do not factor into tU par s 
For this example, we assume a 2:1 decimation rate in each direction, and . 3x3 
tUter kernel. Real apphcauons ot decimation may use larger filter kernels but this 
^nl.nr"ri° illustrate the techniques used. We assume here that pcount is a 
multiple of drow, and that dro\v<srovv/2.. 

void Decimatetto.ochrome(int8 'src. .ni 8 'dst. int srow. .nt drow. int pcount 
intS kOO. int8 kOl. ini8 k02. ^'*^u■■l. 
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»ni3 klO. int3 kll. .nt8 kl2 
'ni3 K20. tnre k2l. ini8 k22i I 
'nt i.j.k: 

fcr (k=G.:=0. k.'=pccuni: j ; 
let OO. i!=drow: |+^t j 

I 

^ ••<-=2'(S.-ow-dfOw); 

j 

Assembler code tor inner loop: 



1: A. SUB 



f2.r3.srov/ 



1-123 ,6:-1("r2r 

G0EAL.3 r6.r6r7 

G.MUL.3 r4.f6kOC 

GMULADD.8 r4.r7:i<0l.r4 

G.DEAL8 r6.r6r7 

GMULAD0.8 r4,r6.k02.r4 

^128 r6.-1{r6) 

G.DEALS f6.f6^7 

GMULAD0.8 r4.r6k10f<J 

G.MULADD.8 r4.r7.kn f4 
f6.1(r8) 

G.0EAL.8 r6 r6 r7 

G,MULADD.8 r4.r6.k12.r4 

AAOp r2.r8.srov» 

h r6.-1(r2) 

G.DEALS r6.r6r7 

GMULA0D.8 r4.r6.k00r4 

G.MULADD.8 r4.r7.k01 r4 

L 128 r6.1(r2) 

G.DEALS r6.r6.r7 

G.MULADD.8 r4.r6.k02 r4 

G.C0MPRESS.16 r4,r4.8 ' 

r4.0(r9) 

AADD r8.15 

A.ADD r9.8 

BNE r8.no.lb 



unused. „d for ^^vt "„3 ; of pi^d lSr„r=T„ldt V?"" ''5 
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Our tjrst example is the one-dimensional horizonnl Kilr^r \r. ^ 
specined bv coetlicients kO k4 to s° ec rhl f . 
selected so that kO-kl"k2-k5-k4 - • 
u-eiahted sum is truncated r.iher Th , 

ove'rflow. rather than rounded, agam. to avoid the possibiHty of 

vcic Hcri20n:3iDec.matior.Cclcfi.n;a "sr:. :rr = -.-,5. m c,^... 

intS kC. .r.t8 k1. ,nt8 k2. iniS k3 n-s'kafr ' P"""'- 

ini i./.k. ' • / I 

;of (k=0.r=0. k'=pccunt; ) ( 
?Cf (j=0; |!=drow: ]+^.) j 

dsi(k*-J = (srcn-8]-kO * 3rcfi.4)-k1 ^ srcril-k2 4- 
src(i-fa)-k3 zrch.syw^ )»8 

dst(k--i = (src[.-8]-kO ^ s.-c[.-^]-ki + srcrirk2 ^ 
src(i+^rk3 ^ s.~:,*3J-k4 )»8-" 

ss:[k*.; = (5rc[,-8)-k0 . s,-:::.^]-ki * src(.]'k2 ^ 
src[.+4]-k3 * 3rci;^3J-W )»8- 

- cJs!(k.^j = (src[i-8J-kO * src:>4]-k1 + src(irk2 * 
.^^^ src[i+4j-k3 4. src[:*3j-k4 )»8: 

1 

I i+ =4 srow+ srow-drow-drow): 

I 

Which results in che follou-ing inner loop: 

^' L^28 r6.-8(r8) 

G.DEAL32 r6.r6 r7 

G.MUL.8 r4.r6.k0 

G.MULADD.8 r4.r7.k1 f4 

^--128 r5.0(r8) 

G.DEAL32 r6.r6 r7 

G MULADD.8 r4.r6 k2 r4 

G.MULADD.8 r4.r7.k3 r4 

L.128 r6.8(r8)' 

G.DEAL32 r6.r6.r7 

G MULADD.8 r4.r6 k4 r4 
G.C0MPRESS.16 f4.r4'.8 " 

S64 r4.0(rg) 

A-ADD r8.16 

A-ADD r9.8 

BNE r8.f10.1b 

s^^^'^ixek i-de^^Fn; ' T'^' "u"' P'-l^^'^y^^. -hen the fUter kerx^el 
Ss/cyde r ^' ^ 'J^'^^^^ P" 2 pixels, or 0.3 

When decimating in the vertical direction, the rate is even higher still- 
vo.d Ven.ca:De„ s .st. .nt srow. int .row. in, pcoun.. 
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'Cf (k=C.t=G. k^=pccunt: ) | 
• ?cr (J=0. j' = 4'afovv; . 

src(i^4-5rcwj-k3 . src[.+3-3rcw]-k4 )»6 
I ^ = a •( src'.v - src w-d ro;% i . 

t 

VC'hich results in the following inner loop: 

1 A.SUB r2.fS rov/'9 

1-64 r3.0(r2) 

G.MUL.3 r4 r3 kO 

A-SUB r2.r3.rc.vK 

L-S-* r3.0(r2) 

G.MULADD.8 r4.r3kVr4 

'-•6'' f3.0(r3) 

G.MULADD.3 r4 r'' ^2 '4 

'^ADD r2.r6:'cw!^ 
r3.0(f2) 

G.MULA0D.8 r4.r3 k3 r4 

AADD r2.r8.ro'.vt3 
r3.0(r2) 

G.MULA0D.8 r4.r3 k4 t4 

G. COMPRESS.'. 6 r4 r4"8 ' 

S64 f4.0-8(r9) 

oNE r8.rl0.1b 

To Be„e,„s ,he decimated „„y shown above, for a image. ^2 pixels are 
Mcycles. ' ' ""^^^ decimated in 11 

kernel. Real appbcations of decimation may use larger filter kernels bur rhic , ,! 

inl8 klO. intS k1i. int8 kl2 
intS k20. intS k2l. intS k2^) I 
int i.j.k: 

for (k=r0.i=0: k»=4-pcouni: ) { 
for (i=0: j!=drow; j++) ( 

'^'''''rcl74mQ''%'*;;nK° * «ft*-r'^°^''«^' * srcli-4-srow.4rk02 . 
srcii-4j klO + src[i]'k11 + src[i+4)-k12 + 
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src[i-4-3ro-.v-a}--20 - s.-c;>4-src-.vl-k21 - Sfcii^-i-3fC,v*aj'k22!»3: 
src[i-4] kl'j -r a.-^.., -^^ t ♦ src|i+4]*kl2 * 

srcii+4-3rc-.v-4]-k2C. - 3rcii,^-srowJ-k21 ^ £;c{i-4-srcv/.jj-k22!»8: 



Sfc(i-4] kiO + s.-C!:i-:<r, - src[i-^4j-kl2 + 

src[i+4-3ro//-4]-k2C - 5fc[i*4-srow]'k21 * srcf.+4-srcw^4]-k22)»3: 



src[i-4]*kl0 - 5rc:«j-k- I src(i+4l'k12 

src(i*4-srow-4]-k20 . src[.^4-srowj-k2l ^ srcfi+4-5rcw^4rkPPl»3 



I 

i+=4'(srow+srow-drcvv-drcwi 

I 

I 

Assembler code for inner loop: 

1 A.SUB r2.rS.3rov/ 

L-128 r6.4(r2^ 

G.DEAL32 rS.rS 

G.MUL.8 r4.r6 kDC 

G.MULADD.8 r4j7 kOl 

L.128 r6.4(r2) 

G.DEAL.32 r5.f6 

G-MULADD.8 r4.f6 kC2 f4 

L.128 r6..4(r8) 

G.DEAL.32 r6.r6 

G.MULADD.8 r4.r6 klO r4 

G.MULADD.8 r4.r7.k11 r4 

L.128 r6.4(r8) 

G.DEAL.32 r6.r6 

G.MULADD.8 r4.r6 k12.r4 

A.ADD r2.r8.srov/ 

L.128 r5.-4(r2) 

G.DEAL32 r6.r6 

G.MULADD.B r4.r6.k00 r4 

G.MULADD.8 f4.r7.k01 r4 

L.128 r6.4(r2) 

G.DEAL32 r6.r6 

G.MULADD.B r4.r6.k02 f4 
G.COMPRESS.16 r4.r4.8 

S.54 r4,0(r9) 

A.ADD r8.16 

A. ADD r9.8 

B. NE r8.fl0.1b 



After some reordering of the address calculation instructions, the inner loop is 16 
cycles per 2 pixels, or 0.12 pixels/c>xle. 

Fractional Interoolstinn 

This section is under construction. 
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Imaoe O omoms^ion Aon!inntinn<=: 

The tollowing examples demonstrate kev Dortions TPFr \iuv:n - 

intoT?reau.l?K«.? Tronsrorr:, (DCT) to transform raster-image data 

m.o a trcquer,cy-based representation tnat is more amenable to entropy coding. 

l^t^i:';^!^::!^^:^'?^ '''''' - >• 

and stores that th.Tl oor ^ estimates assume single-cycle loads 

ana stores, that is. thev do not account tor losses due to cache misses Hovvever 



Internal 8x8 Matrix Transoos 



1-u Nxed-pomt a-point Discrgre Ccs;ne TransfSFf^T 



o n I- ^ ' ■ — _ v.v^o.;:3 t [ CXI Jbiurn 

^• u Mxea-pomt 8-by-B Uiscrete Cr..:n= Tr.ncw^ 



-D Fiurilit.M-po.nt b-Doint Uiscrst e Losme Tran.<^:n r~ 

£:-D FlOAfinn-nnint r_Kn/ q r^.^ _■- ^ 



2-0 i- ioatmg-po.nt b-bv-8 Discrete Ucs.ne Tran sfoT?^^ 



cycles per 
Dixel 



0.4 



1.0 
2.8 



0.6 
1.9 



2-p hxt^u-pomt 8-by-b Uiscrets Co3;ne Transform ikriPF?? 



2-D Hualif>Q-DQ.nt 8-bv-8 Discrete, tosine rran.-^forrrerjPEG 



2.3 
1.4 



J 



perform a second, identical DCT. then transport the matrix again. 

This example details the transposition of an 8-by-8 matrix of 16-bit values stored 
TSHUFFT ^ P^f^^n^ed entirely in registe s us n. 

G.SHUFFLE insirucuons and a technique described in 60. in which the fLt and 
second halves of the matrix are shuffled logjN times. 



Assume the matrix originally is in the order: 



0 12 3 4 
8 9 10 11 12 
16 17 18 19 20 
28 

36 

40 41 42 43 44 
48 49 50 51 52 
L 56 57 58 59 60 



7 
15 



24 25 26 27 

32 33 34 35 



5 6 
13 14 
21 22 23 
29 30 31 
37 38 39 
45 46 47 
53 54 55 
61 62 63 



^lone. Harold. -Parallel Processing ^kh the Pcrtcct Shuffle ' IEEE Transactions on 
Computers. Vol C.20. No. 2. February 1971. 15} Iransactions on 
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Arcer one shuffling, the matrix is in the order: 



0 32 
-I 36 
S -JO 



1 33 2 3-1 3 35 

5 57 6 38 7 39 

9 Al 10 42 11 43 

12 44 15 45 14 46 15 47 

16 48 17 49 IS 50 19 5! 

20 52 21 53 22 54 23 55 

24 56 25 27 26 58 27 59 

L 28 60 29 61 30 62 31 63 



^ 

.•\ltcr a second shuffling, the matrix is in the order: 



•0 16 52 4S 
2 18 34 50 
4 20 36 52 
6 22 38 54 
8 24 40 56 



1 17 33 49 
3 19 35 51 
5 21 37 55 
7 23 39 55 
9 25 41 57 



10 26 42 5S 11 27 43 59 
12 28 44 60 13 29 45 61 
14 30 46 62 15 31 47 65 



After a third shuffling, the matrix is in the order: 



8 

9 



0 
1 
2 
3 
4 
5 
6 

L 7 15 



16 24 32 40 48 56 



1/ 25 33 41 49 57 

10 18 26 34 42 50 58 

11 19 27 35 43 51 59 

12 20 28 36 44 52 60 
15 21 29 37 45 53 61 
14 22 30 38 46 54 62 

23 31 39 47 55 63 



C code for procedure: 

void Matrix8By8Transpose(int16 'src. .nH6 'dst) ( 
int16 tm0[64j; ' ' 

intl6 tml[64]; 
int i; 

|or (i=0; i<32: { tinO(2'il = src(ij; tm0[2'i*11 = srcfi+321 » 
J for 0=0: .<32: | dstfg-ij = tm1[i]; dst(2rui] L trmfiilai]: ) 

Assembler code for procedure: 

_Matrix8By8Transpose 
L.128.1 
L.128.1 

G.SHUFFLE.8 
L.128.1 

G.SHUFFLE.8 
L.128.1 

G.SHUFFLE.8 
L.128.1 



r4.r2.0 
r12.r2.64 
r20.r4.r12 
r6.r2.16 
r22.r5.r13 
r14.r2.80 
r24.r6.f 14 
r8.r2.32 



# 00 01 02 03 04 05 06 07 

# 32 33 34 35 36 37 38 39 
f 00 32 01 33 02 34 03 35 
« 08 09 10 11 12 13 14 15 

# 04 36 05 37 06 38 07 39 

# 40 41 42 43 44 45 46 47 

# 08 40 09 41 10 42 11 43 

# \6 17 18 19 20 21 22 23 
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G. SHUFFLE 5 

L. 123.1 

G. SHUFFLE 3 
L.128 I. . 

G. SHUFFLE 3 
L.128.1 

G. SHUFFLE 5 
G SHUFFLE- 3 

G. SHUFFLE. 8 
G.SHUFFLE.3 
G.SHUFFLE-3 
G.SHUFFLE.3 
G. SHUFFLE. 8 
G. SHUFFLE 3 
G.SHUFFLE.a 
G.SHUFFLE.8 

G. SHUFFLE 3 

S.128.! 
G.SHUFFLE.3 

S.128.1 
G.SHUFFLE 3 

S.12S.I 
G.SHUFFLE.3 

S.12S.I 
G. SHUFFLE 8 

S.128.1 
G.SHUFFLE.8 

S.128.1 
G.SHUFFLE.8 

S.128.{ 
G.SHUFFLE.8 

S.128.1 

B 



'2c.r7.r * 5 
r'c..'2.9c 
•'2£ r3.r:£ 

f • C.r2.^c^ 
■ 3C.r9.r" 7 
r:3.r2.";:2 
r32.r:C.ro 
r34.r: 1 :!9 

''-i.r2Q.'25 
i'6.r2t.r29 
rS.:22.r30 
^•0:23. '31 
rl2.r2^.r32 
n<i.r25.rj3 
r16.r26..'3- 
rt3.:2? r35 

r20.U.r:2 

r20.r3.j 

r22.r5.r:5 

r22.r3.'56 

r24.r6 r:^ 

r24 r3.32 

r25.r7.M5 

r26.r3.48 

r28.r8.r16 

r28.r3.c^ 

r30,r9.r17 

r30.r3.SG 

r32.r10.rl8 

r32.r3.96 

r34.rn.rl9 

r34.r3.n2 

rO 



^ 12 44 :3 45 14 45 16 47 
-43 49 50 51 52 53 54 55 

- 16 48 17 49 IS 50 19 51 
^ 24 25 26 27 28 29 30 31 
* 20 52 21 53 22 54 23 55 

- 56 57 53 5a6C»6l 52 63 

- 24 56 25 57 26 58 27 59 

- 23 60 29 61 3G 62 31 63 

^ 00 16 32 48 01 17 33 49 

- 02 13 34 50 03 19 35 51 

- 04 20 36 52 05 21 37 53 

- 06 22 33 54 07 23 39 55 

- 08 24 40 56 09 25 41 57 
^ 10 26 42 58 11 27 43 59 
n 12 28 44 60 13 29 45 6l 
-14 30 46 62 15 31 47 63 

^ 00 08 16 24 32 40 48 56 

^01 09 17 25 33 41 49 57 

- 02 10 18 26 34 42 50 58 
#03 n 19 27 35 43 51 59 

- 04 12 20 28 36 44 52 60- 

- 05 13 21 29 37 45 53 61 
#06 14 22 30 38 46 54 62 
# 07 15 23 31 39 47 55 63 



The resulting code transposes an S-by-S matrix using 25 o'cles. 
#include 'jinclude.h' 



((x) » (shft)) • 

/• lose a little precision to avoid overflow 7 



#define RIGHT_SHIFT(x.shtt) 
#define LG2_DCT^SCALE 15 
#define ONE ((INT32) 1) 
#define DCT.SCALE (ONE « LG2.DCT_SCALE) 

- ln'thT.^»?i^"^ s coup'e more bits 7 

r foVsS rrSLTon. 7 '^^^""^ ''''''' ^^^^^ ^"^^ V 

#define LG2_0VERSCALE 2 

#deJine OVERSCALE (ONE « LG2_0VERSCALE) 

tfdetine OVERSHin-(x) ((x) «= LG2_OVERSCALE) 

^'Copyright {O 1991. Thomas G. Lane. 
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*• Scale a fractional ccnstani ov DCT_SCAL- • 
^:i5fir.e FIX(x) ((INT32) ((xj ' OCT^SCALE ^ Q.5f) 

r Scale a fractional ccnstani dv DCT_SCALE.CVERSCALE*V 

Sucn a constant can be multiplied v/ith an cvsrscaiea input 7 
■ ' to produce scrr.etnmg that's scaled l:y DCT SCALE V 
:iaefr:e FIXO(x) ((INT32) {(x) ' DCT^CALE / GVERSCALE ^ 0.5)) 

" Descale and ccrrsctiy round a va:u€ tMa*'3 s--|ed *-y DCT SCALE 7 

«def.ne UNFIX(x) RlGHT^SHIFT((x» ^ cONE <; (LG2_DCT,SCALE.l)). LG2.DCT.SCAL£. 

Same with an adaincnal divisron bv 2. le. correctlv rounded UNFiXrx/?i v 
^define UNFIXH(x) RiGHT.SHIFTux, . (ONE « LgWcT SCA^^^^^ 

?.^.t'."vf^^"'^'^ DCT^CALE and round to Integer scaled by OVERSCALE 7 
^der.ne UNFIXO(x) RiGHT_SHIFT((x) . (ONE « (LG2 DCT SCALE 1^-^52 OV^^^^^^^ 

LG2^DCT^SCALE~LG2jDVERSCALE) " ^v^-i.uvtMbu. .L..).. 

r Here are the constants we neea '/ 
/* SlN^ij is sine of rpi/j. scaled by CCT_5CALE 7 
COS_u IS cosine of j-pi/j. scaled oy DCT^CALE 7 

^define SIN_1_4 FlX(0.70710673l) 
#def!ne C0S,1_4 SIN.1_4 

^^define SiN_l_8 FIX(0.382633432) 
#de;ine C0S_1_8 FIX(0.923879533) 
#define SIN_3.B 
#define COS_3_8 SiNllIa 

#define SIN^1^16 FIX(0. 195090322) 
#define C0S_1.16 FIX(0.980785280) 
#defrne SIN,7_16 C0S_1.16 
Adeline C0S.7_15 SIN_lIi6 

^define SIN_3^16 FIX(0.555570233^ 
^^deiine COS„3„16 FIX(0. 83146961 2) 
#define SIN_5_16 COS 3 16 
#define COS_5_15 SInIsIiS 

r OSINJJ is sine of i*pi/j. scaled by DCT_SCALE'OVERSCALE 7 
/• OCOSjJ IS cosine of i'pi/j. scaled by DCT.SCALE/OVERSCALE V 

^define 0SIN_1_4 FIXO(0. 707 106781) 
#define 0C0S_1_4 0SIN_i_4 

^define 0SIN_1_8 FIXO(0.382683432) 
#define OC0S_1.8 FIXO(0.923879533) 
#define OSrNJ 8 QCOS_l^8 
#define OCOS.ie OSIN.i J 

#define 0SIN_1_16 FIXO(0. 195090322) 
#define 0C0S_1^16 FIXO(0.980785280) 
#define 0SIN^7^16 OCOS 1 16 
#define OCOS,7^15 OSINllIie 

^^define OSIN_3_16 FIXO(0.555570233) 
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noeUne OCCS_3_16 RXOCC 331^656 .2) 
J^denne 0SIN.5_15 OCOsXlo 
ffderine OCOS.S.ic OS(N 3 16 



* Psrrcrm a i-dirr^ensicnal DCT. 

• iNctr rhat this coae is specralizea :c rr^e case DCTSIZE = 8 
/ ** 

INLINE 
LOCAL void 

jasLcic:.3 (DCTELEM -in. Int stride) 

/' many imps have ncnoveriarnino t]i=,r,^^ 
;^shculd be able to do fhTlS^^ll " 

INT16 inC. inl. in2. jn3. in4. inS in6 in?- 

INTie :mpO. tmpl. tmp2. imp3. imp4. tmcc. :rr.o6 lmo7- 

INTio impio. tmpll. tmp12. tmpi3 ' 

INT16 tmpi4. impT5. impl6. tmol?- 

INT 16 tmp25. tmp26: 

inO = !n( 0): 
inl = in(siride ]: 
in2 = !n(stride-2 
in3 = in[stride*3; 
in4 = in[stnde*4 
inS = in(stnde'5 
in6 = ln[stride'6 
in? s in[stride*7^ 

tmpO = in? + inO: 
tmpl = In6 + inl; 
tmp2 = in5 + jn2: 
tmp3 = jn4 + in3: 
imp4 = in3 • in4; 
tmp5 = in2 - inS: 
tmoo = im - in6: 
tmp7 = inO - in?: 

tmplO = tmp3 tmpO: 
impn = tmp2 + tmpl: 
tmpl 2 = tmpl . tmp2: 
tmpl 3 = impO - tmp3; 

in[ 0] = (DCTELEM) UNFIXH((lmp10 * tmpl 1) • SIN i 4) 
m[stride-4] = (DCTELEM) UNRXH((tmplO - tmpll) ' cdsll_4): 

in[stride-2j = (DCTELEM) UNFIXH(tmp 13X08 1 8 ^ tmDl2-SIN l flv 
.n{strrde-6] = (DCTELEM) UNFIXH(tmSl3-SIN:T.8 I^^^^^^ 

tmp16 = UNF[X0{(tmp6 + tmp5) " SIN l 4)- 
implS :r UNFIX0((tmp6 - tmp5) * COSll.4); 

0VERSHIFT(tmp4) 
0VERSHIFT(tmp7): 
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' tmp4. trr.o?. tmpi5. tr-pi6 are -verscsiac cy OVERSCALE 

irr.a^ = imp4 + tmplS. 
rrrs25 = tmp4 - imp 15: 
= mo7 - tmp16: 
.pi7 = jrnp7 + tmpi6: 

:r.:.s:r..e-31 = (DCTELEM) UNFIXHCxcZS-OcSsi 6 ^So'Ll^^^^^ 



Ferfcrm the forward OCT on one fclcck of sarnpies. 

A 2-0 OCT can bs done by 1-D OCT en ^^nrt rev 
rc:lcv/sd by 1-D OCT on each column 

GLOBAL void 

L^va^dct (DCTBLOCK data) 
snt i: 

for (i = 0: i< DCTSI2E; U^) 
fasLdct_8(data+i*DCTSI2E. 1): 

for (f = 0: i < DCTS12E: I++) 
^ rasLdcL8(data+i, DCTSIZE): 

The assembler code for the above procedure. caUed u-ith stride=8. is as foUows- 
_fast_oct_8: 

[- ]28.l r4.r2.0 # inO = in( 01; 

L 28. r6.r2.16 # .nl = ,n stride : 

L 28. r8.r2.32 # m2 = in[stride4]: 

L ?B 'l^'lt^ * in3 = in(stride'3]; 

^]28. f12.r2.64 # in4 = in(stride-4]: 

\ r14.r2.80 # inS = in|stride-5]; 

r16.r2.96 # in6 = in[stride-6]; 

nArTn^c rl8.r2.112 # in? = inlsthdeVJ: 

^aRRO! r20.r18.r4 # tmpO = in? * inO; 

p aRh ^ r22.rl6.r6 # tmpi = in6 * inl; 

^■«RS-,! r24.r14.r8 # tmp2 = inS + in2; 

rinull r26.r12.rlO # tmp3 = in4 * in3: 

rinV.l '^8.r10.rl2 # tmp4 = in3 - in4: 

rlnp « 'l^-'^ '^"^ # tmp5 = in2 - in5: 

r lHn f r32.r6.r16 # tmp6 = inl - in6: 

! r34.r4.r18 # tmp7 = inO - in7: 

r Ann A '^f •'26.r20 # tmpiO = tmp3 + tmpO: 

? '^o-1o 'F « tmpii=tmp2 + tmpl: 

r InR « 't'^ 'll '^^ # tmpi2 = tmpl .tmp2: 

r Ann ^ f42.r20.r26 # tmp13 = tmpO - tmp3; 

G.AD0.16 r48.r36.r3e # = tmpiO + tmpll 

G.MULAD0.16 r44.r48.SSIN 1 4S32768. 

G.MULADD.16 r46.r49.SSIN r4 $32768 
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G.EXTRACT.1.16 
S.12S.I 
G. SUB. 16 
G.MULADD.15 
G MULADD 15 
G. EXTRACT 1.16 
S 12S.I 

G..VIULADD 16 
G MULA0D.16 
G MULADD.16 
G.MULADD.16 
G.EXTRACTJ.16 
S.129I 

G.WULADD.16 

G.MULADD 16 

G.MULADD 16 

G.MULADD. 16 

G.EXTRACT.M6 

S. 128.1 

G ADD.16 

G.MULADD 16 

G.MULADD. 16 

G.EXTRACT.1.16 

G.SUB.16 

G MULADD.16 

G.MULADD. 16 

G.EXTRACT.L16 

G.SHL.16 

G,SHL.16 

G.ADD 16 

G.SUB.16 

G.SUB.16 

G.ADD.16 

G.MULADD. 16 

G.MULADD 16 

G.MULADD. 16 

G. MULADD.16 

G.EXTRACT.L16 

S 128.1 

G.MULADD. 16 

G. MULADD.16 

G.MULADD.16 

G.MULADD.16 

G.EXTRACT.L16 

S.128J 

G.MULADD.16 

G.MULADD.16 

G.MULADD.16 

G.MULADD.16 

G.EXTRACT.f.16 

S. 128.1 

G.MULADD.16 

G.MULADD.16 

G MULADD.16 

G.MULADD. 16 

G.EXTRACT.1.16 

S.128.1 

R 



r^6 



f44 
f46 

r^d 
r46 
f44 
r^4 
r44 
r46 

r46, 
r44, 
r44 
r48 
r44 
r46 
f44 
r48 
r46 
r48, 
r46, 
r28, 
r34. 
r48 
r50 
r52 
r54 
r44, 
r46 
r44 
r46 
r44 
r44 
r44 
r46 

f44 

r46 
r44 
r44, 
r44. 
r46. 
r44 
r46 
r44 
r44 
r44, 
r46. 
r44 
r46 
r44 
r44 
rO 



-I 



J.- :)42. cc 
■f'-*3.SCG5_1 4 53276= 
.•--i.r4c ic 

-r2;64 s in.rstr;de"4| 
r42.SC05^i J.$3276S 
.r43.3COS,1^.?. 332763 
-G.SSlN_l_3.r44 
■ 4:.SSiN^; 3,r4o 
.r44 f^e,]S 

.r2.32 ^ in[str:de'2I = 
.r42.SSirg^i_8.332763 
r43.SSirLl^8.332753 
r40.S-CGS 1 8r44 
r-^LS-CCS.I 3.r46 
f44.r46. i6 

-^^.96 # tnfsrride'ol = .. 
.r32.r3C 4 = impo - impo 
.r48.SS;N_l_4.54096 
.r49.53iN_"l ^4.34096 
r44 r4e.i4 impio = . 

.r32.r30 4 = tmoo - tmp5 
r48.5COS_1_4.S4096 
r49.5COS_1 4.S4C96 
r46.r46.14 # tmpi5 = 



r28.2 
r34.2 
.r28.r45 
.r28.r46 
r34.r44 
.r34.r44 
.r54.$OCOS.1 
r55.$OCOS 1 



- 0VERSHlFT(tmp4) 
^ 0VERSHIFT(imp7) 
n tmpl4 = imp4 + tmpiS: 

# tmp25 = tmp4 • tmpiS: 

# tmp26 = imp7 - tmpi6. 

# tmpl7 = tmp7 + ifnpi6 
16.$32768 
16.S32753 



.r48.SOSrN 1 16 r44 
.r49.$OSlN_ri6.f46 
.i'44.r46. 16 

.r2.16 # infsinde] = 
.r54.$OCOS_7.16.S32768 
.r55.SOCOS_7^16.$32768 
.r48.S-OSlN 7_16r44 
.r49.S-OSlN^7 16f46 
.'■44.r46.16 

.r2.i12 # m[siride'7] = 
r52.SOCOS.5^l6.$32768 
r53.SOCOS_5_16.S32768 
.r50.$OSlN_5 16.r44 
.r51.$OSlN_5**16.r46 
.r44.r46.15 

.r2.80 # in(stride-51 = 
r52.$OCOS.3_16.S3276B 
r53.$OCOSj_16.S32768 
r50.$.OSIN.3 16.r44 
rSl.S-OSIN 3_16.f46 
r44.r46.16 

r2.48 # in[stnde-31 = 



398 



Case 2:05-cv-00505-TJW 

wo 97/«>7450 



Document 1 49 Filed 1 0/1 5/2007 



Page 2 of 47 

PCT/US96/13047 



'^^A f r'cm ' G.MULADD. 10 G.EXTR-ICT I 

and 2 G.SHL mscrucaons. which can be scheduled in 64 cycles. This code 

The code for a 2-dimensional DCT. uses the I -dimensional DCT above an Sv8 
''""'^ 1-dimens.onal DCT. and a second l-dimensional DCT. The 
llfrninlVk ^P"^"^."^ ^Hich are performed benveen these steps can be 
1 Crn^fn f ^ 'nlinmg. SO we can estimate the performance bv countm. 

the Group instrucuons alone, which total to 2'^64^2 ^24 or 176 cvcies The ^ 
dimensional DCT covers 64 pixels, which works out to a rate of -2.8 cSes^'pLl 
An inverse DCT should have similar performance characteristics. ' 

lolI^Sl ?n\?'° P"tormed using half-precision (16.bu) tloatins-point 
u£na hTnrl. ' ^."""^"^^^''0" ot intermediate terms is pertormed 

lOnl of rf/ r q'uT J'?-^l^\'<S^'r'° °^ G.MULADD instructions and 
the G \ 11 Ann '"^ G.EXTKACT.I instructions can be removed. Also. 10 of 
the O.MULADD operations become simple G.MUL. Thus 8 1 -Dimensional DCTs 
would use 10 GF.ADD. 10 GF.SUB. 10 GF.MUL. 3 GF.MULADD 3 GF Ml£^^^^^ 
uses^/Mr? "pT^ \%n-'^J °' cycl^^p.^d. and the 2-dimensionar8x8 DCT 
Z^i " 120 cycles, or 1.9 cycles/pixel. An inverse DCT should have 
similar pertormance characteristics. 

Further enhancemRnt^ \.vher !'<^^rf i n JPFG ninnrrthm 

Because the output of the DCT is scanned into a hnear sequence of items the 
^J"?! transpose operation can easily be eliminated. This reduces the fi.xed-point 
DCT cost to 2"-64+24 = 152 cycles, or 2.4 cycles/pixel; the tloating-point DCT cost 
is reducred to 2 36+24 = 96 cycles, or 1.5 cycles/pixel. 

The following section demonstrates that the transpose cost can be reduced to 16 
cycles, by using a combination of memorv loads and stores and the G SHUFFLE 
operations, producing a fixed-point DCT in 2"64 -i- 16 = 144 cvdes or 2 3 
cycles/pixel and floating-point DCT in 2*36 + 16 = 88 cycles, or 1.4 cycles/pixel. ' 

Other Matrix Aoolinntinn<^ 
Interna! 4x4 Matrix Tran<=inn=:i=> 

This example details the transposition of a 4-by-4 matrix of 16-bit values stored 
consecutively in memory. The calculation is performed entirely in registers, using 
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G. SHUFFLE instructions and a techniauc describr^r in 6^ u- l l - 
second halves ot the matrix arc shuffled iog:\- Ses """^ 



Assume the matrix originally is in the order: 



0 12 5 
-•567 
8 9 10 11 
L 12 17 14 15 . 



-^ter one shuffling, the matrix is in the order: 



0 8 

2 10 

-I 12 

L 6 14 



1 9 

3 11 

5 13 

7 15 J 



.^'tcr a second shuffling, the tnatrix is in the order: 



0 
1 

2 

3 



4 8 12 

5 9 13 

6 10 14 

7 11 15 J 



C code for procedure: 

void Matrix4By4Transpose(int16 'src intlB -ds-i ! 
inn 6 lm0[15j: ' 
int16 tm1(16]; 
int i: 

*Z '^f '^^^ ' '""^f^-il = src[i]: tmOr2-i^1] = src{i*8r I 
^ »or (.=0: .<8: | dst[2-i] = tm0[ij; 6stl2;.:]L tmd[l8]- | 

Assembler code for procedure; 

.SubMatrixTranspose: 

L.128.1 r4.r2.0 

L.128.1 r6.r2.16 

G.SHUFR.E.8 rB.r4 r6 

G.SHUFFLE.a r10.r5.r7 

G.SHU.'^FLE.B r4.r8 rIO 

S.128.1 r4.r3.0 

G.SHUFFLE.8 r6.r9ni 

S.128.1 r6.r3.16 

B rO 



# 00 01 02 03 04 05 06 07 

# 08 09 10 11 12 13 14 15 
#00 08 01 09 02 10 03 11 

# 04 12 05 13 06 14 07 15 

#00 04 08 12 01 05 09 13 
#02 06 10 14 ca 07 11 15 



The resulting code transposes a 4-by-4 matrix using 5 cycles. 

^^Stone. Harold. "Parallel Processing u-ith ihe Perfect Shufnr ' TPPF T 

Computers. Vol C.20. No. 2. Februa« 1971. 153 ^ Transaa.ons on 
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A large maurix muv not fit in the reaic-^r ill ^ j -r . ... 

by ™od,fvu,s ,he cod. ,oUf>- A= row sfa^of o" he J^^^^^^^^^ °' ' 

the matrix. Thus another useful evtenSn nf T^^T^ ^^."^ ^"o^^^^^ ^If^ent m 



0 

8 

16 



1 
9 
17 



4 
12 



24 25 26 27 

32 33 34 35 

40 41 42 43 

48 49 50 51 

L 56 57 



2 5 
10 II 
18 19 20 
28 
36 
44 
52 

58 59 60 



5 6 7 
13 14 .15 
21 22 25 
29 30 31 
37 38 39 
45 46 47 
53 54 55 
61 62 63 



0 
1 
2 
3 
4 
5 
6 
7 



8 16 24 

9 17 25 
10 
11 

12 20 28 

13 21 

14 22 



18 26 
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extenaed to handle a 4x4 submatrix bv splitiine the L\?R] ^r.A cr i:>aV 
instructions each into pairs of L.64.I and S:64J Sct^^ns. The cos ol the 4xi 

IraSspose. ^ subcarrix can be faster than using the 8x8 submatri.K 

Shaded G raohios AnplicafinnR 

This section is under construction. 

MnemnPivnfi SvRtf^nn Aonlin^tinn 

MicroUnity-s Terpsichore system architecture uses nine Mnemosvne memorv 
devices m its base conhguranon. providing a nine bvte-w.de paths betu-een the 
processor and memory. The memory dences arc used to buiJd a 0.5 Mbyte cache 
between Terpsichore's firs t level caches and DRAM-based main memory The 

!!c?LTn™t^^ ?r instructions ,o incr.„,.nc ,he src poin.er by .he row size between 

each L instructions. t2kin5 no additional cvclcs 

.DDfierrfrK™ " T^f 'Tl '^'j^/"'"^ ''^'i'««"8 '«> -h^. the same rndex can be 
applied to the tu-o pointers for the L and S instructions. 
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main memory store consists ot 9. IS or 56 banks ot" l\K-7? ^rrov. u u i 
e.ghreen 4 Mbit DiLWlsK which vields 6-1 ns or%6 Mb Jes o^Frr u 
8.16. or 32 Mbvies of directory scoraa. ' " °* '"="'°^>' ^^'"^ 



Terpsichore processor 




1 J t I 1 j t jHIBRAM l/F 
^1 iH 9 X Mnemosyne 

tpp td bd hd I 1 byte cache 

" ^ ' ' ■ ' DRAM array 




Terpsichore memory system application 



I^*or"fou^Xmow^.^^'^' ""7°'' '"^ ^'''^'^ •'^'^^-"^^^^ — 
rwo or tour Mnemosyne memory devices mav be placed in each of the nine bvte 

wide paths. Such configurations use 18 36 72 nr ijj ko„u f t >f ' 

coSm^Sed."' " " °f memory can bJ 

LTf'Sr "f %^,^''-''yte cache line size. Each cache line is associated with an 
ocdet (8 byres) of directory intormarion. using one of the nine "Hermes channels" 

of the nme Heiines channels contain the eight octlets (eight bvte units) ^of fhe 

lt\ TA ^" " '° '"d'^'id"'! octlets of cache 

data and directory information at ma.Mmum bandwidth, the directory information 
is scanered evenly among eight of the nine byte bnes. mrormation 
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Tvoicsl Ce rberus nnnf'ot imtinnc: ■ 

The number ot devices in a tvnical Ce'heriK U„< r-„- 

abou, ,1 dev,„s ,;S Mnemosvne. I Te^Tchtc ■ ailiopr, Vpg'^To''' 
IFPGU or .bou, 48 06 M„e„osy„=, , ferplr^for. Tc iSr'^ftSf 




Moderate Terpsichore system aDplication- 
one maximally-confioured processor per board 
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rf 



Terpsjcnore processor 




FPGA 



Moderate Terpsichore system application 
four minimally-confiQu-5d processors per board 
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■erberuR n erformsnn^_ 



When determining the pertormance oi" C-rber.K -^ic i=: i ■ ■ ■ . 
number of devices on the b.s has . c--cal e^e ^ tL • 
configurations ,.-ith a resistive tern,m.tior:7; esnrilted betr*''''"^'"" 




"^^'.^rlS^Z^ result in a 
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WE CLAIM: 

1 . An execution unit thai maintains substantiaUy peak data throughput 
in the unified execution of multiple media dau streams, the execution unit having 
a data path, comprising: 

a multi-precision arithmetic unit coupled to the data path, the multi- 
precision arithmetic unit capable of dynamic partitioning based on the elemental 
width of dau received from the data path: 

a switch coupled to the daui path and programmable to manipulate 
data received from the data path, the switch providing data streams to the data 
path; and 

an extended mathematical element coupled to the data path and 
programmable to implement additional mathematical operations at substantially 
peak data throi^hput. 

2. The execution unit defined in claim 1 . wherein the multi-precision 
execution unit is configurable to divide the data into component symbols of various 
sizes, analyze the component symbols based upon instructions, and re-synthesize 
tiie component symbols for communication over the data path. 

3 . The execution unit defmed in claim 2. wherein the multi-precision 
execution unit is operable to perform unique operations on each component 
symbol. 

4. The execution unit defined in claim 2, wherein the mathematical 
element is operable to perform fmitc group, finite field, finite ring and table look- 
up operations on the symbols. 

5. The execution unit defmed in claim 1, wherein the arithmetic unit is 
programmable to perfoim Boolean, integer and floating point matiiematical 
operations. 
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6. The execution unii defined in claim 5. wherein the operations 
performed by the arithmetic unit aie capable of being performed at various levels 
of precision. 

7. The execution unit defined in claim I. wherein the manipulation of 
data comprises copying, shifting and re-sizing data. 

8. The execution unit defined in claim 1 . further comprising control to 
maximize use of the execution unit by performing operations at peak data width of 
the data path. 

9. The execution unit defined in claim 2, wherein the size of 
component symbols match. 

10. An execution unit having a data path, comprising: 
at least one register file coupled to the data path; 

a multi-precision arithmetic unit coupled to the data path, the multi- 
precision arithmetic unit capable of dynamic partitioning based on the elemental 
width of data received from the data path; 

a switch coupled to the data path and programmable to manipulate 
data received fh)m the data padi. the switch providmg data streams to the data 
path; and 

an extended mathematical element coupled to the data path and 
programmable to implement additional mathematical operations at substantially 
peak data throughput. 

11. An execution unit having a data patii. comprising: 

a multi-precision aritiimetic unit coupled to the data path, the multi- 
precision arithmetic unit capable of dynamic panitioning based on the elemental 
width of data received from the data path; 
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means coupled to the data path for manipulating data received from 
the data path, the means for manipulating data being programmable and providing 
a data signal to the data path: and 

an exiereled mathematical element coupled to the dau path and 
programmable to implement additional mathematical operations at substantially 
peak data througl^ut. 

12. A general purpose programmable media processor having an 
instruction path and a date path to digitally process a plurality of media dau 
streams, comimsii^: 

a high bandwidth external interface operable to receive a plurality of 
data of various sizes from an external source and communicate the received data 
over the date path at a rate that mainteins substantially peak operation of the media 
processor; 

at least one register file configurable to receive and store date from 
the date path and to communicate the stored date to the date path; and 

a multi-precision execution unit coupled to the date path, the multi- 
precision execution unit configurable to partition date received fh)m the date path 
to account for the elementel symbol size of the plurality of media date streams, 
and programmable to operate on the date to generate a unified symbol output to 
die date path. 



13. The media processor defined in claim 12, wherein the execution unit 
is dynamically configurable to partition date received from the date path. 

14. The media processor defined in claim 12, further comprising: 
means for moving date between registers and memory by 

performing load and store operations, and for coordinating the sharing of date 
among a plurality of tasks by perfonning synchronization operations based upon 
instructions and date received by the execution unit; 
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means for securely controlling the sequence of execudon by 
performing branch and gateway operations based upon instructions and data 
received by the execution unit; and 

a memory management unit, the memoiy management unit operable 
to retrieve data a«l instructions for timely and secure communication over the data 
path and instruction path. 



15. The media processor defined in claim 14. further comprising: 

a combined instiuction cache and buffer. d« combined instruction 
cache and buffer dynamically allocated between cache space and buffer space to 
ensure real-time execution of multiple media instruction streams; and 

a combined data cache and buffer, the combined data cache and 
buffer dynamically allocated between cache space and buffer space to ensure real- 
time lesponsc for multiple media data streams. 

16. TTk media processor defined in claim 15. wherein real-time 
execution is ensured by dynamicaUy allocating instruction buffer space to the 
smallest and most fiequenUy executed blocks of media instructions. 

17. The media processor defined in claim 15, wherein real-time 
response is ensured by dynamically allocating data buffer space to the tallest and 
most frequently accessed working sets of media data. 

18. The media processor defined in claim 12. wherein media data 
streams comprise Nyquist sampled mputs and outputs. 

19. The media processor defined in claim 12. wherein media data 
streams originate from standard computer memoiy and I/O interfaces. 

20. nie media processor defined in claim 12. wherein the multi- 
precision execution unit is configurable to divide the data into component symbols 
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Of various sizes, analyze the component symbols based Upon instnictions. and re- 
synthesize the component symbols for communication over the data path. 

21 . The media processor defined in claim 12. wherein the phirality of 
media data streams comprise presentation media information, transmission media 
information, and storage media information. 

22. The media processor defined in claim 21 , wherein presentation 
media information compris« audio, video, image, and graphical information 

23. The media processor defined in claim 21 , wherein transmission 
media information comprises radio and network data transmissions;. 

24. The media processor defined in claim 21 , wherein storage media 
information comprises data encoded in moving and solid-state inemoiy media. 

25. The media processor defined m claim 12. wherein the widtii of the 
data path is at least 128 bits. 



26. The media processor defined in claim 12, wherein the multi- 
precision execution unit comprises a dynamically partiiionable arithmetic unit, 
register controllable cross-bar switch, and an extended mathematical element. 

27. The media processor defined in claim 24, wherein the register 
controllable cross-bar switch comprises a Benes network design. 

28. The media processor defined m claim 26. wherein the register 
controlhible cross-bar switch is programmable and is operable to manipulate 
symbols. 
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29. The media processor defined in claim 22, wherein the extended 
mathematical element is operable to perfonn finite group, finite field, finite ring 
and table look-up operations on the symbols. 

30. Tbc media processor defined in claim 12. further comprising a set 
of predefined instructions accessible by a user. 

31. The media processor defined in claim 13. wherein the means for 
performing load, store, and synchronization operations and the means for 
performing branch and gateway operations comprises a set of predefined 
instructions accessible by a user. 

32. The media processor defined in claim 31 . wherein the predefined 
instructions are combinable to implemem composite fi,nctions on the plurality of 
media data streams. 



33. A high bandwidth processor interface for receiving and transmitting 
a media stream, comprising: 

a data path, the data path operable to transmit media information at 
sustained peak rates; 

a plurality of memory controllers, the plurality of memoiy 
controllers coupled to the data path in series to communicate stored media 
information to and from the data path; and 

a plurality of memory elements coupled to each of the plurality of 
niemory comioUen; in parallel, the plurality of memory elements for storing and 
retrievii^ the media information. 

34. The high bandwidth processor interface defined in claim 33 
wherein the data path comprises a plurality of data paths forming a high bandwidth 
data channel. 
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35. The high bandwidth processor interface defined in claim 34, 
wherein the high bandwidth data channel is uni-dircctional. 

36. The high bandwidth processor interface defined in claim 33, further 
comprising a general purpose programmable media processor coupled to the high 
bandwidth data channel to receive, process and transmit media information at 
substantially peak rates. 

37. The high bandwidth processor interface defined in claim 33, 
wherein the peak rate of operation comprises at least one gigabyte of information 
per second from point to point. 

38. The high bandwidth processor interface defmed in claim 33, 
wherein the plurality of memory controllers each comprise a paired link disposed 
between each memory controller, the paired links each for transmitting and 
receiving plural bits of data and having differential data inputs and outputs and a 
differential clock signal. 

39. The high bandwidth processor interface defined in claim 38, 

. wherein the paired link further comprises a digital skew calibrator to adjust the 
phiral bits of daU relative to the differential clock signal to elunmate skew 
between the data. 

40. The high bandwidth processor interface defmed in claim 38, 
wherein the paired Imk further comprises a phase locked loop to eliminate jitter in 
the differential clock signal transmitted between paired links. 

41. The high bandwidth processor interface defmed in claim 38, 
wherein the plural bits comprise eight bits of data. 
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42 . The high bandwidth processor interface defined in claim 38 
Wherein the paired links each further comprise termination resistors to form 
matched impedances for each paired link. 

43. The high bandwidth processor interface defined in claim 34 
Wherein the high bandwidth data channel comprises phiral parallel high bandwidth 
data channels. 

44. A system for unified media processing comprising: 

a plurality of general purpose media processors, each media 
processor operable at sustained peak data rates and having a dynamically 
partitioned execution unit and a high bandwidth interface, the high bandwidth 
interface coupled ,o extei,^! memory and input/output elements to receive and 
transmit data to the media processor at substantially peak rates; 

a bi-directional communication fabric, the plurality of media 
processors coupled to the bi-directional communication fabric to transmit and 
receive at least one media stream comprising presentation, transmission, and 
storage media information. 

45. The system defined in claim 44, wherein the bi-diieaional 
communicarion fabric comprises a fiber opUc network. 

46. The system defmed in claim 44. wherein the bi-directional 
communicaUon fabric comprises an heterogeneous network. 

47. The system defined in claim 44. wherein the bi-directional 
communication fabric comprises a coaxial cable network. 

48. The system defined in claim 44, wherein the bi-direcUonal 
communication fabric comprises a wireless network. 



414 



Case 2:O5-cv-0O505-TJW Document 1 49 Filed 1 0/1 5/2007 Page 1 8 of 47 

wo 97/07450 



PCTAJS96/I3047 



49. The system defined in claim 44. wherein a subset of the plurality of 
media processors comprise network servers. 

50. Tht system deflned in claim 44. wherein the plurality of media 
processors are programmable by downloading program information over the bi- 
directi(mal communication fabric. 

51. The system defined in claim 44. wherein the each of the plurality of 
media processors can access an idle execution unit of another media processor in a 
shared manner to efficiently process presemation, transmission and storage media 
information at substantially peak data rates. 

52. The system defined in claim 44. wherein each media processor 
fiirther comprises dedicated memory and wherein the each of the plurality of 
media processors can employ any unused portion of the dedicated memory of 
another media processor in a shared marnier to efficiently store and retrieve 
presentation, transmission and storage media information at substantially peak data 
rates. 

53. A paraUel multi-processor system that maintains substantially peak 
data throughput in the unified execution of multiple media streams, the system 
having a data path, comprising: 

at least one high bandwidth external interface, the at least one high 
bandwkhh external interface coupled to the data path and operable to receive a 
plurality of data of various sizes from an external source and communicate the 
received data at a rate that maintains substantially peak operation of the parallel 
multi-processor system; 

a phirality of register files, each register file having at least one 
general purpose register coupled to U,e data path and operable to store a working 
set of media data; and 

at least one multi-precision execution unit coupled to the data path, 
the at least one multi-precision execution unit dynamically configurable to partition 
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dau Within a working set of media data received from the data path to account for 
the elemental symbol size of the plurality of media streams, and programmable to 
operate in parallel on working sets of data stored in the plurality of register files to 
generate a unified symbol output for each register file. 

54. The parallel multi-processor system defined in claim 53. wherein 
the at least one execution unit alternates in a round robin mamier to operate on 
data stored in the plurality of register files. 

55. The parallel multiprocessor system defined in claim 53. further 
conq>rising an instruction pre-fetch pipeline. 

56. The parallel multi-processor system defined in claim 55. wherein 
the instruction pre-fetch pipeline comprises a super-string pipeline. 

57. The parallel multi-processor system defined in claim 55, wherein 
the instniction pre-fetch pipeline comprises a super-spring pipeli 



jme. 



58. The parallel multi-processor system defined in claim 53. further 
comprising a data pre-fetch pq)eline. 

59. The parallel multi-processor system defined in claim 58. wherein 
the data pre-fetch pipeliiK cwnprises a super-string pipeline. 

60. The parallel multi-processor system defined in claim 58. wherein 
the data pre-fetch pipeline comprises a super-spring pipeline. 

61 . The parallel multi-processor system defined in claim 53. further 
comprising a requester, responder and transponder daemon. 

62. A method for processing unified streams of media data, comprising 
the steps of: 
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receiving a stream of unified media data including presentation, 
transmission and storage information; 

dynamically partitioning the unified stream of media data into 
component fields of at least one bit based on the elemental symbol size of dau 
received; and 

processing the unified stream of media data at substantially peak 

operation. 

63. The method defined in claim 62. wherem the step of processing the 
urafied stream of media data comprises the steps of: 

storing the stream of unified media data in a general register file; 

performing muhi-precision arithmetic operations on the stored 
stream of unified media data based on programmed instructions, the multi- 
precision arithmetic operations including Boolean, integer and floating point 
mathematical operations; 

manipulating the component fields of unified media data based on 
prt>gnmimed instructions that implement copying, shifting and re-sizing opeiations: 
and 

peri'orming muhi-precision mathematical operations on the stored 
stream of unified media data based on programmed instructions, the mathematical 
operations inchiding finite group, finite field, fmite ring and table look-up 
operations. 

64. TTie method defined in claim 63, further comprising the steps of: 
pre-fetching instructions and data to fill instruction and data 

pipelines: 

peribrming memory management operations to retrieve instructions 
and data from external memory; 

storing instructions and data in instruction and data cache/buffers; 

and 

dynamically allocating buffer storage in the instruction and data 
cache/buffers to ensure real-time execution. 
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65. The method defined in claim 63. funher comprising the step of 
providing a set of instructions to process the stream of unified media data, the set 
of instructions including load, stoie. synchronization, branch and gateway 
instructions. 

66. The method defined in claim 65. further comprising the step of 
programming a sequence of at least one instruction fix)m the set of instructions. 

67. A method for achieving high bandwidth communications between a 
general purpose media processor and external devices, comprising the steps of: 

providing a high bandwidth interface disposed between the media 
processor and the external devices, the high bandwidth interface comprising at 
least one unidirectional charaiel pair having an input port and an output port; and 

transmitting and receiving a plurality of media data streams, 
comprising component fields of various sizes between the media processor and the 
external devices at a rate that sustains substamially peak dau throughput at the 
media processor. 

68. The method defined in claim 67, wherein the step of providing a 
high bandwidth imerfece ftirther comprises providing a plurality of external 
devices, the plurality of external devices coupled in series on the at least one uni- 
directional chaimel pair. 



69. The mediod defined in claim 67, wherein the step of providing 
high bandwidth interface further comprises providing a plurality of pai^lel 
directional channel pairs. 



a 

uni- 



70. A method for processing streams of media data, comprising the 
steps of: 

providing a bi-directional communicarions fabric for transmitting 
and receiving at least one stream of unified media data, the at least one stream of 
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unified media data comprising presentation, transmission and storage information; 



and 



providing at least one programmable media processor within the 
communications network, the at least one programmable media processor for 
receiving, processing and transmitting the at least one stream of unified media data 
over the bi-directional communications fabric. 
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